Hands, Gestures and Popping back to ‘Prague’

Just a short post to follow up on this previous post;

Hands, Gestures and a Quick Trip to ‘Prague’

I said that if I ‘found time’ then I’d revisit that post and that code and see if I could make it work with the Kinect for Windows V2 sensor rather than with the Intel RealSense SR300 which I used in that post.

In all honesty, I haven’t ‘found time’ but I’m revisiting it anyway Smile

I dug my Kinect for Windows V2 and all of its lengthy cabling out of the drawer, plugged it into my Surface Book and … it didn’t work. Instead, I got the flashing white light which usually indicates that things aren’t going so well.

Not to be deterred, I did some deep, internal Microsoft research (ok, I searched the web Smile) and came up with this;

Kinect Sensor is not recognized on a Surface Book

and getting rid of the text value within that registry key sorted out that problem and let me test that my Kinect for Windows V2 was working in the sense that the configuration verifier says;

image

which after many years of experience I have learned to interpret as “Give it a try!” Winking smile and so I tried out a couple of the SDK samples and they worked fine for me and so I reckoned I was in a good place to get started.

However, the Project Prague bits were not so happy and I found they were logging a bunch of errors in the ‘Geek View’ about not being able to connect/initialise to either the SR300 or the Kinect camera.

This seemed to get resolved by me updating my Kinect drivers – I did an automatic update and Windows found new drivers online which took me to this version;

image

which I was surprised that I didn’t have already as it’s quite old but that seemed to make the Project Prague pieces happy and the Geek View is back in business showing output from Kinect;

image

and from the little display window on the left there it felt like this operated at a range of approx 0.5m to 1.0m. I wondered whether I could move further away but that didn’t seem to be the case in the quick experiment that I tried.

The big question for me then was whether the code that I’d previously written and run against the SR300 would “just work” on the Kinect for Windows V2 and, of course, it does Smile Revisit the previous post for the source code if you’re interested but I found my “counting on four fingers” gesture was recognised quickly and reliably here;

image

This is very cool – it’d be interesting to know exactly what ‘Prague’ relies on from the perspective of the camera and also from the POV of system requirements (CPU, RAM, GPU, etc) in order to make this work but it looks like they’ve got a very decent system going for recognising hand gestures across different cameras.

Hands, Gestures and a Quick Trip to ‘Prague’

Sorry for the title – I couldn’t resist and, no, I’ve not switched to writing a travel blog just yet although I’ll keep the idea in my back pocket for the time when the current ‘career’ hits the ever-looming buffers Winking smile

But, no, this post is about ‘Project Prague’ and hand gestures and I’ve written quite a bit in the past about natural gesture recognition with technologies like the Kinect for Windows V2 and with the RealSense F200 and SR300 cameras.

Kinect has has great capabilities for colour, depth, infra-red imaging and a smart (i.e. cloud-trained AI) runtime which can bring all those streams together and give you (human) skeletal tracking of 25 joints on 6 bodies at 30 frames per second. It can also do some facial tracking and has an AI based gesture recognition system which can be trained to recognise human-body based gestures like “hands above head” or “golf swing” and so on.

That camera has a range of approx 0.5m to 4.5m and perhaps because of this long range it does not have a great deal of support for hand-based gestures although it can report some hand joints and a few different hand states like open/closed but it doesn’t go much beyond that.

I’ve also written about the RealSense F200 and SR300 cameras although I never had a lot of success with the SR300 and those cameras have a much shorter range (< 1m) than the Kinect for Windows V2 but have/had some different capabilities in that they have surfaced functionality like;

  • Detailed facial detection providing feature positions etc and facial recognition.
  • Emotion detection providing states like ‘happy’, ‘sad’ etc (although this got removed from the original SDK at a later point)
  • Hand tracking features
    • The SDK has great support for tracking of hands down to the joint level with > 20 joints reported by the SDK
    • The SDK also has support for hand-based gestures such as “V sign”, “full pinch” etc.

With any of these cameras and their SDKs the processing happens locally on the (high bandwidth) data at frame rates of 15/30/60 FPS and so it’s quite different to those scenarios where you might be selectively capturing data and sending it to the cloud for processing as you see with the Cognitive Services but both approaches have their benefits and are open to being used in combination.

In terms of this functionality around hand tracking and gestures, I bundled some of what I knew about this into a video last year and published it to Channel9 although it’s probably quite a bit out of date at this point;

image

but it’s been a topic that interested me for a long time and so when I saw ‘Project Prague’ announced a few weeks ago I was naturally interested.

My first question on ‘Prague’ was whether it would be make use of a local-processing or a cloud-based-processing model and, if the former, whether it would require a depth camera or would be based purely on a web cam.

It turns out that ‘Prague’ is locally processing data and does require either a Kinect for Windows V2 camera or a RealSense SR300 camera with the recommendation on the website being to use the SR300.

I dug my Intel RealSense SR300 out of the drawer where it’s been living for a few months, plugged it in to my Surface Book and set about seeing whether I could get a ‘Prague’ demo up and running on it.

Plugging in the SR300

I hadn’t plugged the SR300 into my Surface Book since I reinstalled Windows and so I wondered how that had progressed since the early days of the camera and since Windows has moved to Creators Update (I’m running 15063.447).

I hadn’t installed the RealSense SDK onto this machine but Windows seemed to recognise the device and install it regardless although I did find that the initial install left some “warning triangles” in device manager that had to be resolved by a manual “Scan for hardware changes” from the Device Manager menu but then things seemed to sort themselves out and Device Manager showed;

image

which the modern devices app shows as;

image

and that seemed reasonable and I didn’t have to visit the troubleshooting page but I wasn’t surprised to see that it existed based on my previous experience with the SR300 but, instead, I went off to download ‘Project Prague’.

Installing ‘Prague’

Nothing much to report here – there’s an MSI that you download and run;

image

and “It Just Worked” so nothing to say about that.

Once installation was figured, as per the docs, the “Microsoft Gestures Service” app ran up and I tried to do as the documentation advised and make sure that the app was recognising my hand – it didn’t seem to be working as below;

image

but then I tried with my right hand and things seemed to be working better;

image

This is actually the window view (called the ‘Geek View’!) of a system tray application (the “gestures service”) which doesn’t seem to be a true service in the NT sense but instead seems to be a regular app configured to run at startup on the system;

image

so, much like the Kinect Runtime it seems that this is the code which sits and watches frames from the camera and then applications become “clients” of this service and the “DiscoveryClient” which is also highlighted in the screenshot as being configured to run at startup is one such demo app which picks up gestures from the service and (according to the docs) routes the gestures through to the shell.

Here’s the system tray application;

image

and if I perform the “bloom” gesture (familiar from Windows Mixed Reality) then the system tray app pops up;

image

and tells me that there are other gestures already active to open the start menu and toggle the volume. The gestures animate on mouse over to show how to execute them and I had no problem with using the gesture to toggle the volume on my machine but I did struggle a little with the gesture to open the start menu.

The ‘timeline’ view in the ‘Geek View’ here is interesting because it shows gestures being detected or not in real time and you can perhaps see on the timeline below how I’m struggling to execute the ‘Shell_Start’ gesture and it’s getting recognised as a ‘Discovery_Tray’ gesture. In that screenshot the white blob indicates a “pose” whereas the green blobs represent completed “gestures”.

image

There’s also a ‘settings’ section here which shows me;

image

and then on the GestPacks section;

image

suggests that the service has integration for various apps. At the time of writing, the “get more online” option didn’t seem to link to anything that I could spot but I noticed by running PowerPoint that the app is monitoring which app is in the foreground and is switching its gestures list to relate to that contextual app.

So, when running PowerPoint, the gesture service shows;

image

and those gestures worked very well for me in PowerPoint – it was easy to start a slideshow and then advance the slides by just tapping through in the air with my finger. These details can also be seen in the settings app;

image

which suggests that these gestures are contextual within the app – for example the “Rotate Right 90” option doesn’t show up until I select an object in PowerPoint;

image

and I can see this dynamically changing in the ‘Geek View’ – here’s the view when no object is selected;

image

and I can see that there are perhaps 3 gestures registered whereas if I select an object in PowerPoint then I see;

image

and those gestures worked pretty well for me Smile 

Other Demo Applications

I experimented with the ‘Camera Viewer’ app which works really well. Once again, from the ‘Geek View’ I can see that this app has registered some gestures and you can perhaps see below that I am trying out the ‘peace’ gesture and the geek view is showing that this is registered, that it has completed and the app is displaying some nice doves to show it’s seen the gesture;

image

One other interesting aspect of this app is that it displays a ‘Connecting to Gesture Service’ message as you bring it back into focus suggesting that there’s some sort of ‘connection’ to the gestures service that comes/goes over time.

These gestures worked really well for me and by this point I was wondering how these gestures apps were plugging into the architecture here, how they were implemented and so I wanted to see if I could write some code. I did notice that the GestPacks seem to live in a folder under the ‘Prague’ installation;

image

and a quick look at one of the DLLs (e.g. PowerPoint) shows that this is .NET code interop’ing into PowerPoint as you’d expect although the naming suggests there’s some ATL based code in the chain here somewhere;

image

Coding ‘Prague’

The API docs link leads over to this web page which points to a Microsoft.Gestures namespace that seems to suggest is part of .NET Core 2.0. That would seem to suggest that (right now) you’re not going to be able to reference this from a Universal Windows App project but you can reference it from a .NET Framework project and so I just referenced it from a command line project targeting .NET Framework 4.6.2.

The assemblies seem to live in the equivalent of;

“C:\Users\mtaulty\AppData\Roaming\Microsoft\Prague\PragueVersions\LatestVersion\SDK”

and I added a reference to 3 of them;

image

It’s also worth noting that there are a number of code samples over in this github repository;

https://github.com/Microsoft/Gestures-Samples

Although, at the time of writing, I haven’t really referred to those too much as I was trying to see what my experience was like in ‘starting from scratch’ and to that end I had a quick look at what seemed to be the main assembly in the object browser;

image

and the structure seemed to suggest that the library is using TCP sockets as an ‘RPC’ mechanism to communicate between an app and the gestures service and a quick look at the gestures service process with Process Explorer did show that it was listening for traffic;

image

So, how to get a connection? It seems fairly easy in that the docs point you to the GesturesEndpointService class and there’s a GesturesEndpointServiceFactory to make those and then IntelliSense popped up as below to reinforce the idea that there is some socket based comms going on here;

image

From there, I wanted to define my own gesture which would allow the user to start with an open spread hand and then tap their thumb onto their four fingers in sequence which seemed to consist of 5 stages and so I read the docs around how gestures, poses and motion work and added some code to my console application to see if I could code up this gesture;

namespace ConsoleApp1
{
  using Microsoft.Gestures;
  using Microsoft.Gestures.Endpoint;
  using System;
  using System.Collections.Generic;
  using System.Threading.Tasks;

  class Program
  {
    static void Main(string[] args)
    {
      ConnectAsync();

      Console.WriteLine("Hit return to exit...");

      Console.ReadLine();

      ServiceEndpoint.Disconnect();
      ServiceEndpoint.Dispose();
    }
    static async Task ConnectAsync()
    {
      Console.WriteLine("Connecting...");

      try
      {
        var connected = await ServiceEndpoint.ConnectAsync();

        if (!connected)
        {
          Console.WriteLine("Failed to connect...");
        }
        else
        {
          await serviceEndpoint.RegisterGesture(CountGesture, true);
        }
      }
      catch
      {
        Console.WriteLine("Exception thrown in starting up...");
      }
    }
    static void OnTriggered(object sender, GestureSegmentTriggeredEventArgs e)
    {
      Console.WriteLine($"Gesture {e.GestureSegment.Name} triggered!");
    }
    static GesturesServiceEndpoint ServiceEndpoint
    {
      get
      {
        if (serviceEndpoint == null)
        {
          serviceEndpoint = GesturesServiceEndpointFactory.Create();
        }
        return (serviceEndpoint);
      }
    }
    static Gesture CountGesture
    {
      get
      {
        if (countGesture == null)
        {
          var poses = new List<HandPose>();

          var allFingersContext = new AllFingersContext();

          // Hand starts upright, forward and with fingers spread...
          var startPose = new HandPose(
            "start",
            new FingerPose(
              allFingersContext, FingerFlexion.Open),
            new FingertipDistanceRelation(
              allFingersContext, RelativeDistance.NotTouching));

          poses.Add(startPose);

          foreach (Finger finger in
            new[] { Finger.Index, Finger.Middle, Finger.Ring, Finger.Pinky })
          {
            poses.Add(
              new HandPose(
              $"pose{finger}",
              new FingertipDistanceRelation(
                Finger.Thumb, RelativeDistance.Touching, finger)));
          }
          countGesture = new Gesture("count", poses.ToArray());
          countGesture.Triggered += OnTriggered;
        }
        return (countGesture);
      }
    }
    static Gesture countGesture;
    static GesturesServiceEndpoint serviceEndpoint;
  }
}

I’m very unsure as to whether my code is specifying my gesture ‘completely’ or ‘accurately’ but what amazed me about this is that I really only took one stab at it and it “worked”.

That is, I can run my app and see my gesture being built up from its 5 constituent poses in the ‘Geek View’ and then my console app has its event triggered and displays the right output;

image

What I’d flag about that code is that it is bad in that it’s using async/await in a console app and so it’s likely that thread pool threads are being used to dispatch all the “completions” which mean that lots of threads are potentially running through this code and interacting with objects which may/not have thread affinity – I’ve not done anything to mitigate that here.

Other than that, I’m impressed – this was a real joy to work with and I guess the only way it could be made easier would be to allow for the visual drawing or perhaps the recording of hand gestures.

The only other thing that I noticed is that my CPU can get a bit active while using these bits and they seem to run at about 800MB of memory but then Project Prague is ‘Experimental’ right now so I’m sure that could change over time.

I’d like to also try this code on a Kinect for Windows V2 – if I do that, I’ll update this post or add another one.

Windows 10, 1607, UWP and Experimenting with the Kinect for Windows V2 Update

I was really pleased to see this blog post;

Kinect demo code and new driver for UWP now available

announcing a new driver which provides more access to the functionality of the Kinect for Windows V2 into Windows 10 including for the UWP developer.

I wrote a little about this topic in this earlier post around 10 months ago when some initial functionality became available for the UWP developer;

Kinect V2, Windows Hello and Perception APIs

and so it’s great to see that more functionality has become available and, specifically, that skeletal data is being surfaced.

I plugged my Kinect for Windows V2 into my Surface Pro 3 and had a look at the driver being used for Kinect.

image

and I attempted to do an update but didn’t seem to see one but it’s possible that the version of the driver which I have;

image

is the latest driver as it seems to be a week or two old. At the time of writing, I haven’t confirmed this driver version but I went on to download the C++ sample from GitHub;

Camera Stream Correlation Sample

and ran it up on my Surface Pro 3 where it initially displayed the output of the rear webcam;

image

and so I pressed the ‘Next Source’ button and it attempted to work with the RealSense camera on my machine;

image

and so I pressed the ‘Next Source’ button and things seemed to hang. I’m unsure of the status of my RealSense drivers on this machine and so I disabled the RealSense virtual camera driver;

image

and then re-ran the sample and, sure enough, I could use the ‘Next Source’ button to move to the Kinect for Windows V2 sensor and then I used the ‘Toggle Depth Fading’ button to turn that option off and the ‘Toggle Skeletal Overlay’ button to switch that option on and, sure enough, I’ve got a (flat) skeletal overlay on the colour frames and it’s delivering very smooth performance here;

image

and so that’s great to see working. Given that the sample seemed to be C++ code, I wondered what this might look like for a C# developer working with the UWP and so I set about seeing if I could reproduce some of the core of what the sample is doing here.

Getting Skeletal Data Into a C# UWP App

Rather than attempting to ‘port’ the C++ sample, I started by lifting pieces of the code that I’d written for that earlier blog post into a new project.

I made a blank app targeting SDK 14393, made sure that it had access to webcam and microphone and then added in win2d.uwp as a NuGet package and added a little UI;

<Page
    x:Class="KinectTestApp.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:KinectTestApp"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    xmlns:w2d="using:Microsoft.Graphics.Canvas.UI.Xaml"
    mc:Ignorable="d">

    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
        <TextBlock
            FontSize="36"
            HorizontalAlignment="Center"
            VerticalAlignment="Center"
            TextAlignment="Center"
            Text="No Cameras" />
        <w2d:CanvasControl
            x:Name="canvasControl"
            Visibility="Collapsed"
            SizeChanged="OnCanvasControlSizeChanged"
            Draw="OnDraw"/>
    </Grid>
</Page>

From there, I wanted to see if I could get a basic render of the colour frame from the camera along with an overlay of some skeletal points.

I’d spotted that the official samples include a project which builds out a WinRT component that is then used to interpret the custom data that comes from the Kinect via a MediaFrameReference and so I included a reference to this project into my solution so that I could use it in my C# code. That project is here and looks to stand independent of the surrounding sample. I made my project reference as below;

image

and then set about trying to see if I could write some code that got colour data and skeletal data onto the screen.

I wrote a few, small, supporting classes and named them all with an mt* prefix to try and make it more obvious which code here is mine rather than in the framework or the sample. This simple class delivers a SoftwareBitmap containing the contents of the colour frame to be fired as an event;

namespace KinectTestApp
{
  using System;
  using Windows.Graphics.Imaging;

  class mtSoftwareBitmapEventArgs : EventArgs
  {
    public SoftwareBitmap Bitmap { get; set; }
  }
}

whereas this class delivers the data that I’ve decided I need in order to draw a subset of the skeletal data onto the screen;

namespace KinectTestApp
{
  using System;

  class mtPoseTrackingFrameEventArgs : EventArgs
  {
    public mtPoseTrackingDetails[] PoseEntries { get; set; }
  }
}

and it’s a simple array which will be populated with one of these types below for each user being tracked by the sensor;

namespace KinectTestApp
{
  using System;
  using System.Linq;
  using System.Numerics;
  using Windows.Foundation;
  using Windows.Media.Devices.Core;
  using WindowsPreview.Media.Capture.Frames;

  class mtPoseTrackingDetails
  {
    public Guid EntityId { get; set; }
    public Point[] Points { get; set; }

    public static mtPoseTrackingDetails FromPoseTrackingEntity(
      PoseTrackingEntity poseTrackingEntity,
      CameraIntrinsics colorIntrinsics,
      Matrix4x4 depthColorTransform)
    {
      mtPoseTrackingDetails details = null;

      var poses = new TrackedPose[poseTrackingEntity.PosesCount];
      poseTrackingEntity.GetPoses(poses);

      var points = new Point[poses.Length];

      colorIntrinsics.ProjectManyOntoFrame(
        poses.Select(p => Multiply(depthColorTransform, p.Position)).ToArray(),
        points);

      details = new mtPoseTrackingDetails()
      {
        EntityId = poseTrackingEntity.EntityId,
        Points = points
      };
      return (details);
    }
    static Vector3 Multiply(Matrix4x4 matrix, Vector3 position)
    {
      return (new Vector3(
        position.X * matrix.M11 + position.Y * matrix.M21 + position.Z * matrix.M31 + matrix.M41,
        position.X * matrix.M12 + position.Y * matrix.M22 + position.Z * matrix.M32 + matrix.M42,
        position.X * matrix.M13 + position.Y * matrix.M23 + position.Z * matrix.M33 + matrix.M43));
    }
  }
}

which would be a simple class containing a GUID to identify the tracked person and an array of Points representing their tracked joints except that I wanted those 2D Points to be in the colour space which means having to map them from the depth space that the sensor presents them in and so the FromPoseTrackingEntity() method takes a PoseTrackingEntity which is one of the types from the referenced C++ project and;

  1. Extracts the ‘poses’ (i.e. joints in my terminology)
  2. Uses the CameraIntrinsics from the colour camera to project them onto its frame having first transformed them using a matrix which maps from depth space to colour space.

Step 2 is code that I largely duplicated from the original C++ sample after trying a few other routes which didn’t end well for me Smile

I then wrote this class which wraps up a few areas;

namespace KinectTestApp
{
  using System;
  using System.Linq;
  using System.Threading.Tasks;
  using Windows.Media.Capture;
  using Windows.Media.Capture.Frames;

  class mtMediaSourceReader
  {
    public mtMediaSourceReader(
      MediaCapture capture, 
      MediaFrameSourceKind mediaSourceKind,
      Action<MediaFrameReader> onFrameArrived,
      Func<MediaFrameSource, bool> additionalSourceCriteria = null)
    {
      this.mediaCapture = capture;
      this.mediaSourceKind = mediaSourceKind;
      this.additionalSourceCriteria = additionalSourceCriteria;
      this.onFrameArrived = onFrameArrived;
    }
    public bool InitialiseWithMediaCapture()
    {
      this.mediaSource = this.mediaCapture.FrameSources.FirstOrDefault(
        fs =>
          (fs.Value.Info.SourceKind == this.mediaSourceKind) &&
          ((this.additionalSourceCriteria != null) ? 
            this.additionalSourceCriteria(fs.Value) : true)).Value;   

      return (this.mediaSource != null);
    }
    public async Task OpenReaderAsync()
    {
      this.frameReader =
        await this.mediaCapture.CreateFrameReaderAsync(this.mediaSource);

      this.frameReader.FrameArrived +=
        (s, e) =>
        {
          this.onFrameArrived(s);
        };

      await this.frameReader.StartAsync();
    }
    Func<MediaFrameSource, bool> additionalSourceCriteria;
    Action<MediaFrameReader> onFrameArrived;
    MediaFrameReader frameReader;
    MediaFrameSource mediaSource;
    MediaCapture mediaCapture;
    MediaFrameSourceKind mediaSourceKind;
  }
}

This type takes a MediaCapture and a MediaSourceKind and can then report via the Initialise() method whether that media source kind is available on that media capture. It can also apply some additional criteria if they are provided in the constructor. This class can also create a frame reader and redirect its FrameArrived events into the method provided to the constructor. There should be some way to stop this class as well but I haven’t written that yet.

With those classes in place, I added the following mtKinectColorPoseFrameHelper;

namespace KinectTestApp
{
  using System;
  using System.Collections.Generic;
  using System.Linq;
  using System.Numerics;
  using System.Threading.Tasks;
  using Windows.Media.Capture;
  using Windows.Media.Capture.Frames;
  using Windows.Media.Devices.Core;
  using Windows.Perception.Spatial;
  using WindowsPreview.Media.Capture.Frames;

  class mtKinectColorPoseFrameHelper
  {
    public event EventHandler<mtSoftwareBitmapEventArgs> ColorFrameArrived;
    public event EventHandler<mtPoseTrackingFrameEventArgs> PoseFrameArrived;

    public mtKinectColorPoseFrameHelper()
    {
      this.softwareBitmapEventArgs = new mtSoftwareBitmapEventArgs();
    }
    internal async Task<bool> InitialiseAsync()
    {
      bool necessarySourcesAvailable = false;

      // Find all possible source groups.
      var sourceGroups = await MediaFrameSourceGroup.FindAllAsync();

      // We try to find the Kinect by asking for a group that can deliver
      // color, depth, custom and infrared. 
      var allGroups = await GetGroupsSupportingSourceKindsAsync(
        MediaFrameSourceKind.Color,
        MediaFrameSourceKind.Depth,
        MediaFrameSourceKind.Custom,
        MediaFrameSourceKind.Infrared);

      // We assume the first group here is what we want which is not
      // necessarily going to be right on all systems so would need
      // more care.
      var firstSourceGroup = allGroups.FirstOrDefault();

      // Got one that supports all those types?
      if (firstSourceGroup != null)
      {
        this.mediaCapture = new MediaCapture();

        var captureSettings = new MediaCaptureInitializationSettings()
        {
          SourceGroup = firstSourceGroup,
          SharingMode = MediaCaptureSharingMode.SharedReadOnly,
          StreamingCaptureMode = StreamingCaptureMode.Video,
          MemoryPreference = MediaCaptureMemoryPreference.Cpu
        };
        await this.mediaCapture.InitializeAsync(captureSettings);

        this.mediaSourceReaders = new mtMediaSourceReader[]
        {
          new mtMediaSourceReader(this.mediaCapture, MediaFrameSourceKind.Color, this.OnFrameArrived),
          new mtMediaSourceReader(this.mediaCapture, MediaFrameSourceKind.Depth, this.OnFrameArrived),
          new mtMediaSourceReader(this.mediaCapture, MediaFrameSourceKind.Custom, this.OnFrameArrived,
            DoesCustomSourceSupportPerceptionFormat)
        };

        necessarySourcesAvailable = 
          this.mediaSourceReaders.All(reader => reader.Initialise());

        if (necessarySourcesAvailable)
        {
          foreach (var reader in this.mediaSourceReaders)
          {
            await reader.OpenReaderAsync();
          }
        }
        else
        {
          this.mediaCapture.Dispose();
        }
      }
      return (necessarySourcesAvailable);
    }
    void OnFrameArrived(MediaFrameReader sender)
    {
      var frame = sender.TryAcquireLatestFrame();

      if (frame != null)
      {
        switch (frame.SourceKind)
        {
          case MediaFrameSourceKind.Custom:
            this.ProcessCustomFrame(frame);
            break;
          case MediaFrameSourceKind.Color:
            this.ProcessColorFrame(frame);
            break;
          case MediaFrameSourceKind.Infrared:
            break;
          case MediaFrameSourceKind.Depth:
            this.ProcessDepthFrame(frame);
            break;
          default:
            break;
        }
        frame.Dispose();
      }
    }
    void ProcessDepthFrame(MediaFrameReference frame)
    {
      if (this.colorCoordinateSystem != null)
      {
        this.depthColorTransform = frame.CoordinateSystem.TryGetTransformTo(
          this.colorCoordinateSystem);
      }     
    }
    void ProcessColorFrame(MediaFrameReference frame)
    {
      if (this.colorCoordinateSystem == null)
      {
        this.colorCoordinateSystem = frame.CoordinateSystem;
        this.colorIntrinsics = frame.VideoMediaFrame.CameraIntrinsics;
      }
      this.softwareBitmapEventArgs.Bitmap = frame.VideoMediaFrame.SoftwareBitmap;
      this.ColorFrameArrived?.Invoke(this, this.softwareBitmapEventArgs);
    }
    void ProcessCustomFrame(MediaFrameReference frame)
    {
      if ((this.PoseFrameArrived != null) &&
        (this.colorCoordinateSystem != null))
      {
        var trackingFrame = PoseTrackingFrame.Create(frame);
        var eventArgs = new mtPoseTrackingFrameEventArgs();

        if (trackingFrame.Status == PoseTrackingFrameCreationStatus.Success)
        {
          // Which of the entities here are actually tracked?
          var trackedEntities =
            trackingFrame.Frame.Entities.Where(e => e.IsTracked).ToArray();

          var trackedCount = trackedEntities.Count();

          if (trackedCount > 0)
          {
            eventArgs.PoseEntries =
              trackedEntities
              .Select(entity =>
                mtPoseTrackingDetails.FromPoseTrackingEntity(entity, this.colorIntrinsics, this.depthColorTransform.Value))
              .ToArray();
          }
          this.PoseFrameArrived(this, eventArgs);
        }
      }
    }
    async static Task<IEnumerable<MediaFrameSourceGroup>> GetGroupsSupportingSourceKindsAsync(
      params MediaFrameSourceKind[] kinds)
    {
      var sourceGroups = await MediaFrameSourceGroup.FindAllAsync();

      var groups =
        sourceGroups.Where(
          group => kinds.All(
            kind => group.SourceInfos.Any(sourceInfo => sourceInfo.SourceKind == kind)));

      return (groups);
    }
    static bool DoesCustomSourceSupportPerceptionFormat(MediaFrameSource source)
    {
      return (
        (source.Info.SourceKind == MediaFrameSourceKind.Custom) &&
        (source.CurrentFormat.MajorType == PerceptionFormat) &&
        (Guid.Parse(source.CurrentFormat.Subtype) == PoseTrackingFrame.PoseTrackingSubtype));
    }
    SpatialCoordinateSystem colorCoordinateSystem;
    mtSoftwareBitmapEventArgs softwareBitmapEventArgs;
    mtMediaSourceReader[] mediaSourceReaders;
    MediaCapture mediaCapture;
    CameraIntrinsics colorIntrinsics;
    const string PerceptionFormat = "Perception";
    private Matrix4x4? depthColorTransform;
  }
}

This is essentially doing;

  1. InitialiseAsync
    1. Using the MediaFrameSourceGroup type to try and find a source group that looks like it is Kinect by searching for Infrared+Color+Depth+Custom source kinds. This isn’t a complete test and it might be better to make it more complete. Also, there’s an assumption that the first group found is the best which isn’t likely to always hold true.
    2. Initialising a MediaCapture for the group found in step 1 above.
    3. Initialising three of my mtMediaSourceReader types for the Color/Depth/Custom source kinds and adding some extra criteria for the Custom source type to try and make sure that it supports the ‘Perception’ media format – this code is essentially lifted from the original sample.
    4. Opening frame readers on those three items and handling the events as frame arrives.
  2. OnFrameArrived simply passes the frame on to sub-functions based on type and this could have been done by deriving specific mtMediaSourceReaders.
  3. ProcessDepthFrame tries to get a transformation from depth space to colour space for later use.
  4. ProcessColorFrame fires the ColorFrameArrived event with the SoftwareBitmap that has been received.
  5. ProcessCustomFrame handles the custom frame by;
    1. Using the PoseTrackingFrame.Create() method from the referenced C++ project to interpret the raw data that comes from the custom sensor.
    2. Determining how many bodies are being tracked by the data.
    3. Converts the data types from the referenced C++ project to my own data types which include less of the data and which try to map the positions of joints given using 3D depth points to their respective 2D colour space points.

Lastly, there’s some code-behind which tries to glue this into the UI;

namespace KinectTestApp
{
  using Microsoft.Graphics.Canvas;
  using Microsoft.Graphics.Canvas.UI.Xaml;
  using System.Numerics;
  using System.Threading;
  using Windows.Foundation;
  using Windows.Graphics.Imaging;
  using Windows.UI;
  using Windows.UI.Core;
  using Windows.UI.Xaml;
  using Windows.UI.Xaml.Controls;

  public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
      this.Loaded += this.OnLoaded;
    }
    void OnCanvasControlSizeChanged(object sender, SizeChangedEventArgs e)
    {
      this.canvasSize = new Rect(0, 0, e.NewSize.Width, e.NewSize.Height);
    }
    async void OnLoaded(object sender, RoutedEventArgs e)
    {
      this.helper = new mtKinectColorPoseFrameHelper();

      this.helper.ColorFrameArrived += OnColorFrameArrived;
      this.helper.PoseFrameArrived += OnPoseFrameArrived;

      var suppported = await this.helper.InitialiseAsync();

      if (suppported)
      {
        this.canvasControl.Visibility = Visibility.Visible;
      }
    }
    void OnColorFrameArrived(object sender, mtSoftwareBitmapEventArgs e)
    {
      // Note that when this function returns to the caller, we have
      // finished with the incoming software bitmap.
      if (this.bitmapSize == null)
      {
        this.bitmapSize = new Rect(0, 0, e.Bitmap.PixelWidth, e.Bitmap.PixelHeight);
      }

      if (Interlocked.CompareExchange(ref this.isBetweenRenderingPass, 1, 0) == 0)
      {
        this.lastConvertedColorBitmap?.Dispose();

        // Sadly, the format that comes in here, isn't supported by Win2D when
        // it comes to drawing so we have to convert. The upside is that 
        // we know we can keep this bitmap around until we are done with it.
        this.lastConvertedColorBitmap = SoftwareBitmap.Convert(
          e.Bitmap,
          BitmapPixelFormat.Bgra8,
          BitmapAlphaMode.Ignore);

        // Cause the canvas control to redraw itself.
        this.InvalidateCanvasControl();
      }
    }
    void InvalidateCanvasControl()
    {
      // Fire and forget.
      this.Dispatcher.RunAsync(CoreDispatcherPriority.High, this.canvasControl.Invalidate);
    }
    void OnPoseFrameArrived(object sender, mtPoseTrackingFrameEventArgs e)
    {
      // NB: we do not invalidate the control here but, instead, just keep
      // this frame around (maybe) until the colour frame redraws which will 
      // (depending on race conditions) pick up this frame and draw it
      // too.
      this.lastPoseEventArgs = e;
    }
    void OnDraw(CanvasControl sender, CanvasDrawEventArgs args)
    {
      // Capture this here (in a race) in case it gets over-written
      // while this function is still running.
      var poseEventArgs = this.lastPoseEventArgs;

      args.DrawingSession.Clear(Colors.Black);

      // Do we have a colour frame to draw?
      if (this.lastConvertedColorBitmap != null)
      {
        using (var canvasBitmap = CanvasBitmap.CreateFromSoftwareBitmap(
          this.canvasControl,
          this.lastConvertedColorBitmap))
        {
          // Draw the colour frame
          args.DrawingSession.DrawImage(
            canvasBitmap,
            this.canvasSize,
            this.bitmapSize.Value);

          // Have we got a skeletal frame hanging around?
          if (poseEventArgs?.PoseEntries?.Length > 0)
          {
            foreach (var entry in poseEventArgs.PoseEntries)
            {
              foreach (var pose in entry.Points)
              {
                var centrePoint = ScalePosePointToDrawCanvasVector2(pose);

                args.DrawingSession.FillCircle(
                  centrePoint, circleRadius, Colors.Red);
              }
            }
          }
        }
      }
      Interlocked.Exchange(ref this.isBetweenRenderingPass, 0);
    }
    Vector2 ScalePosePointToDrawCanvasVector2(Point posePoint)
    {
      return (new Vector2(
        (float)((posePoint.X / this.bitmapSize.Value.Width) * this.canvasSize.Width),
        (float)((posePoint.Y / this.bitmapSize.Value.Height) * this.canvasSize.Height)));
    }
    Rect? bitmapSize;
    Rect canvasSize;
    int isBetweenRenderingPass;
    SoftwareBitmap lastConvertedColorBitmap;
    mtPoseTrackingFrameEventArgs lastPoseEventArgs;
    mtKinectColorPoseFrameHelper helper;
    static readonly float circleRadius = 10.0f;
  }
}

I don’t think there’s too much in there that would require explanation other than that I took a couple of arbitrary decisions;

  1. That I essentially process one colour frame at a time using a form of ‘lock’ to try and drop any colour frames that arrive while I am still in the process of drawing the last colour frame and that ‘drawing’ involves both the method OnColorFrameArrived and the async call to OnDraw it causes.
  2. That I don’t force a redraw when a ‘pose’ frame arrives. Instead, the data is held until the next OnDraw call which comes from handling the colour frames.It’s certainly possible that the various race conditions involved there might cause that frame to be dropped and another to replace it in the meantime.

Even though there’s a lot of allocations going on in that code as it stands, here’s a screenshot of it running and the performance isn’t bad at all running it from my Surface Pro 3 and I’m particularly pleased with the red nose that I end up with here Smile

image

The code is quite rough and ready as I was learning as I went along and some next steps might be to;

  1. Draw joints that are inferred in a different colour to those that are properly tracked.
  2. Draw the skeleton rather than just the joints.
  3. Do quite a lot of optimisations as the code here allocates a lot.
  4. Do more tracking around entities arriving/leaving based on their IDs and handle multiple people with different colours.
  5. Refactor to specialise the mtMediaSourceReader class to have separate types for Color/Depth/Custom and thereby tidy up the code which uses this type.

but, for now, I was just trying to get some basics working.

Here’s the code on GitHub if you want to try things out and note that you’d need that additional sample code from the official samples to make it work.