A Follow-On Prague Experiment with Skeletons

A developer dropped me a line having found my previous blog posts around Project Prague;

Project Prague in the Cognitive Services Labs

They’d noticed that it seemed really easy and powerful to define and monitor for gestures with Project Prague but wanted to know where the support was for tracking lower level data such as hand positions and movement. I’ve a suspicion that they are looking for something similar to what the Kinect SDK offers which was out-of-the-box support for treating a user’s hand as a pointer and being able to drive an on-screen UI with it.

As usual, I hadn’t the foggiest clue about how this might be done and so I thought I’d better take a quick look at it and this post is the result of a few minutes looking at the APIs and the documentation.

If you haven’t seen Prague at all then I did write a couple of other posts;

Project Prague Posts

and so feel free to have a read of those if you want the background on what I’m posting here and I’ll attempt to avoid repeating what I wrote in those posts.

Project Prague and the UWP

Since I last looked at Project Prague, “significant things” have happened in that the Windows 10 Fall Creators Update has been released and, along with it, support for .NET Standard 2.0 in UWP apps which I just wrote about an hour or two ago in this post;

UWP and .NET Standard 2.0–Remembering the ‘Forgotten’ APIs –)

These changes mean that I now seem to be free to use Project Prague from inside a UWP app (targeting .NET Standard 2.0 on Windows 16299+) although I’m unsure about whether this is a supported scenario yet or what it might mean for an app that wanted to go into Store but, technically, it seems that I can make use of the Prague SDK from a UWP app and so that’s what I did.

Project Prague and Skeleton Tracking

I revisited the Project Prague documentation and scanned over this one page which covers a lot of ground but it mostly focuses on how to get gestures working and doesn’t drop to the lower level details.

However, there’s a response to a comment further down the page which does talk in terms of;

“The SDK provides both the high level abstraction of the gestures as they are described in the overview above and also the raw skeleton we produce. The skeleton we produce is ‘light-weight’ namely it exposes the palm & fingertips’ locations and directions vectors (palm also has an orientation vector).

In the slingshot example above, you would want to register to the skeleton event once the slingshot gesture reaches the Pinch state and then track the motion instead of simply expecting a (non negligible) motion backwards as defined above.

Depending on your needs, you could either user the simplistic gesture-states-only approach or weave in the use of raw skeleton stream.

We will followup soon with a code sample in https://aka.ms/gestures/samples that will show how to utilize the skeleton stream”

and that led me back to the sample;

3D Camera Sample

which looks to essentially use gestures as a start/stop mechanism in between which it makes use of the API;

GesturesServiceEndpoint.RegisterToSkeleton

in order to get raw hand-tracking data including the position of the palm and digits and so it felt like this was the API that I might want to take a look at – it seemed that this might be the key to the question that I got asked.

Alongside discovering this API I also had a look through the document which is targeted at Unity but generally useful;

“3D Object Manipulation”

because it talks about the co-ordinate system that positions, directions etc. are offered in by the SDK and also units;

“The hand-skeleton is provided in units of millimeters, in the following left-handed coordinate system”

although what wasn’t clear to me from the docs was whether I had to think in terms of different ranges for distances based on the different cameras that the SDK supports. I was using a RealSense SR300 as it is easier to plug in than a Kinect and one of my out-standing questions remains what sort of range of motion in the horizontal and vertical planes I should expect the SDK to be able to track for the camera.

Regardless, I set about trying to put together a simple UWP app that let me move something around on the screen using my hand and the Prague SDK.

Experimenting in a UWP App

I made a new UWP project (targeting 16299) and I referenced the Prague SDK assemblies (see previous post for details of where to find them);

image

and then added a small piece of XAML UI with a green dot which I want to move around purely by dragging my index finger in front of the screen;

<Page
    x:Class="App2.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:App2"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    mc:Ignorable="d">

    <Grid>
        <Canvas HorizontalAlignment="Stretch" VerticalAlignment="Stretch" Background="{ThemeResource ApplicationPageBackgroundThemeBrush}" SizeChanged="CanvasSizeChanged">
            <Ellipse Width="10" Height="10" Fill="Green" x:Name="marker" Visibility="Collapsed"/>
        </Canvas>
        <TextBlock FontSize="24" x:Name="txtDebug" HorizontalAlignment="Left" VerticalAlignment="Bottom"/>
    </Grid>
</Page>

With that in place, I added some code behind which attempts to permanently be tracking the user’s right hand and linking it to movement of this green dot. The code’s fairly self-explanatory I think with the exception that I limited the hand range to be –200mm to 200mm on the X axis and –90mm to +90mm on the Y axis based on experimentation. I’m unsure of whether this is “right” or not at the time of writing. I did experiment with normalising the vectors and trying to use those to drive my UI but that didn’t work out well for me as I never seemed to be able to get more than around +/- 0.7 units along the X or Y axis.

using Microsoft.Gestures;
using Microsoft.Gestures.Endpoint;
using Microsoft.Gestures.Samples.Camera3D;
using System;
using System.Linq;
using Windows.Foundation;
using Windows.UI.Core;
using Windows.UI.Xaml;
using Windows.UI.Xaml.Controls;

namespace App2
{
    public sealed partial class MainPage : Page
    {
        public MainPage()
        {
            this.InitializeComponent();
            this.Loaded += OnLoaded;
        }
        async void OnLoaded(object sender, RoutedEventArgs e)
        {
            this.gestureService = GesturesServiceEndpointFactory.Create();
            await this.gestureService.ConnectAsync();

            this.smoother = new IndexSmoother();
            this.smoother.SmoothedPositionChanged += OnSmoothedPositionChanged;

            await this.gestureService.RegisterToSkeleton(this.OnSkeletonDataReceived);
        }
        void CanvasSizeChanged(object sender, SizeChangedEventArgs e)
        {
            this.canvasSize = e.NewSize;
        }
        void OnSkeletonDataReceived(object sender, HandSkeletonsReadyEventArgs e)
        {
            var right = e.HandSkeletons.FirstOrDefault(h => h.Handedness == Hand.RightHand);

            if (right != null)
            {
                this.smoother.Smooth(right);
            }
        }
        async void OnSmoothedPositionChanged(object sender, SmoothedPositionChangeEventArgs e)
        {
            // AFAIK, the positions here are defined in terms of millimetres and range
            // -ve to +ve with 0 at the centre.

            // I'm unsure what range the different cameras have in terms of X,Y,Z and
            // so I've made up my own range which is X from -200 to 200 and Y from
            // -90 to 90 and that seems to let me get "full scale" on my hand 
            // movements.

            // I'm sure there's a better way. X is also reversed for my needs so I
            // went with a * -1.

            var xPos = Math.Clamp(e.SmoothedPosition.X * - 1.0, 0 - XRANGE, XRANGE);
            var yPos = Math.Clamp(e.SmoothedPosition.Y, 0 - YRANGE, YRANGE);
            xPos = (xPos + XRANGE) / (2.0d * XRANGE);
            yPos = (yPos + YRANGE) / (2.0d * YRANGE);

            await this.Dispatcher.RunAsync(
                CoreDispatcherPriority.Normal,
                () =>
                {
                    this.marker.Visibility = Visibility.Visible;

                    var left = (xPos * this.canvasSize.Width);
                    var top = (yPos * this.canvasSize.Height);

                    Canvas.SetLeft(this.marker, left - (this.marker.Width / 2.0));
                    Canvas.SetTop(this.marker, top - (this.marker.Height / 2.0));
                    this.txtDebug.Text = $"{left:N1},{top:N1}";
                }

            );
        }
        static readonly double XRANGE = 200;
        static readonly double YRANGE = 90;
        Size canvasSize;
        GesturesServiceEndpoint gestureService;
        IndexSmoother smoother;
    }
}

As part of writing that code, I modified the PalmSmoother class from the 3D sample provided to become an IndexSmoother class which essentially performs the same function but on a different piece of data and with some different parameters. It looks like a place where something like the Reactive Extensions might be a good thing to use instead of writing these custom classes but I went with it for speed/ease.

Wrapping Up

This was just a quick experiment but I learned something from it. The code’s here if it’s of use to anyone else glancing at Project Prague and, as always, feed back if I’ve messed this up – I’m very new to using Project Prague.

Hands, Gestures and Popping back to ‘Prague’

Just a short post to follow up on this previous post;

Hands, Gestures and a Quick Trip to ‘Prague’

I said that if I ‘found time’ then I’d revisit that post and that code and see if I could make it work with the Kinect for Windows V2 sensor rather than with the Intel RealSense SR300 which I used in that post.

In all honesty, I haven’t ‘found time’ but I’m revisiting it anyway Smile

I dug my Kinect for Windows V2 and all of its lengthy cabling out of the drawer, plugged it into my Surface Book and … it didn’t work. Instead, I got the flashing white light which usually indicates that things aren’t going so well.

Not to be deterred, I did some deep, internal Microsoft research (ok, I searched the web Smile) and came up with this;

Kinect Sensor is not recognized on a Surface Book

and getting rid of the text value within that registry key sorted out that problem and let me test that my Kinect for Windows V2 was working in the sense that the configuration verifier says;

image

which after many years of experience I have learned to interpret as “Give it a try!” Winking smile and so I tried out a couple of the SDK samples and they worked fine for me and so I reckoned I was in a good place to get started.

However, the Project Prague bits were not so happy and I found they were logging a bunch of errors in the ‘Geek View’ about not being able to connect/initialise to either the SR300 or the Kinect camera.

This seemed to get resolved by me updating my Kinect drivers – I did an automatic update and Windows found new drivers online which took me to this version;

image

which I was surprised that I didn’t have already as it’s quite old but that seemed to make the Project Prague pieces happy and the Geek View is back in business showing output from Kinect;

image

and from the little display window on the left there it felt like this operated at a range of approx 0.5m to 1.0m. I wondered whether I could move further away but that didn’t seem to be the case in the quick experiment that I tried.

The big question for me then was whether the code that I’d previously written and run against the SR300 would “just work” on the Kinect for Windows V2 and, of course, it does Smile Revisit the previous post for the source code if you’re interested but I found my “counting on four fingers” gesture was recognised quickly and reliably here;

image

This is very cool – it’d be interesting to know exactly what ‘Prague’ relies on from the perspective of the camera and also from the POV of system requirements (CPU, RAM, GPU, etc) in order to make this work but it looks like they’ve got a very decent system going for recognising hand gestures across different cameras.

Hands, Gestures and a Quick Trip to ‘Prague’

Sorry for the title – I couldn’t resist and, no, I’ve not switched to writing a travel blog just yet although I’ll keep the idea in my back pocket for the time when the current ‘career’ hits the ever-looming buffers Winking smile

But, no, this post is about ‘Project Prague’ and hand gestures and I’ve written quite a bit in the past about natural gesture recognition with technologies like the Kinect for Windows V2 and with the RealSense F200 and SR300 cameras.

Kinect has has great capabilities for colour, depth, infra-red imaging and a smart (i.e. cloud-trained AI) runtime which can bring all those streams together and give you (human) skeletal tracking of 25 joints on 6 bodies at 30 frames per second. It can also do some facial tracking and has an AI based gesture recognition system which can be trained to recognise human-body based gestures like “hands above head” or “golf swing” and so on.

That camera has a range of approx 0.5m to 4.5m and perhaps because of this long range it does not have a great deal of support for hand-based gestures although it can report some hand joints and a few different hand states like open/closed but it doesn’t go much beyond that.

I’ve also written about the RealSense F200 and SR300 cameras although I never had a lot of success with the SR300 and those cameras have a much shorter range (< 1m) than the Kinect for Windows V2 but have/had some different capabilities in that they have surfaced functionality like;

  • Detailed facial detection providing feature positions etc and facial recognition.
  • Emotion detection providing states like ‘happy’, ‘sad’ etc (although this got removed from the original SDK at a later point)
  • Hand tracking features
    • The SDK has great support for tracking of hands down to the joint level with > 20 joints reported by the SDK
    • The SDK also has support for hand-based gestures such as “V sign”, “full pinch” etc.

With any of these cameras and their SDKs the processing happens locally on the (high bandwidth) data at frame rates of 15/30/60 FPS and so it’s quite different to those scenarios where you might be selectively capturing data and sending it to the cloud for processing as you see with the Cognitive Services but both approaches have their benefits and are open to being used in combination.

In terms of this functionality around hand tracking and gestures, I bundled some of what I knew about this into a video last year and published it to Channel9 although it’s probably quite a bit out of date at this point;

image

but it’s been a topic that interested me for a long time and so when I saw ‘Project Prague’ announced a few weeks ago I was naturally interested.

My first question on ‘Prague’ was whether it would be make use of a local-processing or a cloud-based-processing model and, if the former, whether it would require a depth camera or would be based purely on a web cam.

It turns out that ‘Prague’ is locally processing data and does require either a Kinect for Windows V2 camera or a RealSense SR300 camera with the recommendation on the website being to use the SR300.

I dug my Intel RealSense SR300 out of the drawer where it’s been living for a few months, plugged it in to my Surface Book and set about seeing whether I could get a ‘Prague’ demo up and running on it.

Plugging in the SR300

I hadn’t plugged the SR300 into my Surface Book since I reinstalled Windows and so I wondered how that had progressed since the early days of the camera and since Windows has moved to Creators Update (I’m running 15063.447).

I hadn’t installed the RealSense SDK onto this machine but Windows seemed to recognise the device and install it regardless although I did find that the initial install left some “warning triangles” in device manager that had to be resolved by a manual “Scan for hardware changes” from the Device Manager menu but then things seemed to sort themselves out and Device Manager showed;

image

which the modern devices app shows as;

image

and that seemed reasonable and I didn’t have to visit the troubleshooting page but I wasn’t surprised to see that it existed based on my previous experience with the SR300 but, instead, I went off to download ‘Project Prague’.

Installing ‘Prague’

Nothing much to report here – there’s an MSI that you download and run;

image

and “It Just Worked” so nothing to say about that.

Once installation was figured, as per the docs, the “Microsoft Gestures Service” app ran up and I tried to do as the documentation advised and make sure that the app was recognising my hand – it didn’t seem to be working as below;

image

but then I tried with my right hand and things seemed to be working better;

image

This is actually the window view (called the ‘Geek View’!) of a system tray application (the “gestures service”) which doesn’t seem to be a true service in the NT sense but instead seems to be a regular app configured to run at startup on the system;

image

so, much like the Kinect Runtime it seems that this is the code which sits and watches frames from the camera and then applications become “clients” of this service and the “DiscoveryClient” which is also highlighted in the screenshot as being configured to run at startup is one such demo app which picks up gestures from the service and (according to the docs) routes the gestures through to the shell.

Here’s the system tray application;

image

and if I perform the “bloom” gesture (familiar from Windows Mixed Reality) then the system tray app pops up;

image

and tells me that there are other gestures already active to open the start menu and toggle the volume. The gestures animate on mouse over to show how to execute them and I had no problem with using the gesture to toggle the volume on my machine but I did struggle a little with the gesture to open the start menu.

The ‘timeline’ view in the ‘Geek View’ here is interesting because it shows gestures being detected or not in real time and you can perhaps see on the timeline below how I’m struggling to execute the ‘Shell_Start’ gesture and it’s getting recognised as a ‘Discovery_Tray’ gesture. In that screenshot the white blob indicates a “pose” whereas the green blobs represent completed “gestures”.

image

There’s also a ‘settings’ section here which shows me;

image

and then on the GestPacks section;

image

suggests that the service has integration for various apps. At the time of writing, the “get more online” option didn’t seem to link to anything that I could spot but I noticed by running PowerPoint that the app is monitoring which app is in the foreground and is switching its gestures list to relate to that contextual app.

So, when running PowerPoint, the gesture service shows;

image

and those gestures worked very well for me in PowerPoint – it was easy to start a slideshow and then advance the slides by just tapping through in the air with my finger. These details can also be seen in the settings app;

image

which suggests that these gestures are contextual within the app – for example the “Rotate Right 90” option doesn’t show up until I select an object in PowerPoint;

image

and I can see this dynamically changing in the ‘Geek View’ – here’s the view when no object is selected;

image

and I can see that there are perhaps 3 gestures registered whereas if I select an object in PowerPoint then I see;

image

and those gestures worked pretty well for me Smile 

Other Demo Applications

I experimented with the ‘Camera Viewer’ app which works really well. Once again, from the ‘Geek View’ I can see that this app has registered some gestures and you can perhaps see below that I am trying out the ‘peace’ gesture and the geek view is showing that this is registered, that it has completed and the app is displaying some nice doves to show it’s seen the gesture;

image

One other interesting aspect of this app is that it displays a ‘Connecting to Gesture Service’ message as you bring it back into focus suggesting that there’s some sort of ‘connection’ to the gestures service that comes/goes over time.

These gestures worked really well for me and by this point I was wondering how these gestures apps were plugging into the architecture here, how they were implemented and so I wanted to see if I could write some code. I did notice that the GestPacks seem to live in a folder under the ‘Prague’ installation;

image

and a quick look at one of the DLLs (e.g. PowerPoint) shows that this is .NET code interop’ing into PowerPoint as you’d expect although the naming suggests there’s some ATL based code in the chain here somewhere;

image

Coding ‘Prague’

The API docs link leads over to this web page which points to a Microsoft.Gestures namespace that seems to suggest is part of .NET Core 2.0. That would seem to suggest that (right now) you’re not going to be able to reference this from a Universal Windows App project but you can reference it from a .NET Framework project and so I just referenced it from a command line project targeting .NET Framework 4.6.2.

The assemblies seem to live in the equivalent of;

“C:\Users\mtaulty\AppData\Roaming\Microsoft\Prague\PragueVersions\LatestVersion\SDK”

and I added a reference to 3 of them;

image

It’s also worth noting that there are a number of code samples over in this github repository;

https://github.com/Microsoft/Gestures-Samples

Although, at the time of writing, I haven’t really referred to those too much as I was trying to see what my experience was like in ‘starting from scratch’ and to that end I had a quick look at what seemed to be the main assembly in the object browser;

image

and the structure seemed to suggest that the library is using TCP sockets as an ‘RPC’ mechanism to communicate between an app and the gestures service and a quick look at the gestures service process with Process Explorer did show that it was listening for traffic;

image

So, how to get a connection? It seems fairly easy in that the docs point you to the GesturesEndpointService class and there’s a GesturesEndpointServiceFactory to make those and then IntelliSense popped up as below to reinforce the idea that there is some socket based comms going on here;

image

From there, I wanted to define my own gesture which would allow the user to start with an open spread hand and then tap their thumb onto their four fingers in sequence which seemed to consist of 5 stages and so I read the docs around how gestures, poses and motion work and added some code to my console application to see if I could code up this gesture;

namespace ConsoleApp1
{
  using Microsoft.Gestures;
  using Microsoft.Gestures.Endpoint;
  using System;
  using System.Collections.Generic;
  using System.Threading.Tasks;

  class Program
  {
    static void Main(string[] args)
    {
      ConnectAsync();

      Console.WriteLine("Hit return to exit...");

      Console.ReadLine();

      ServiceEndpoint.Disconnect();
      ServiceEndpoint.Dispose();
    }
    static async Task ConnectAsync()
    {
      Console.WriteLine("Connecting...");

      try
      {
        var connected = await ServiceEndpoint.ConnectAsync();

        if (!connected)
        {
          Console.WriteLine("Failed to connect...");
        }
        else
        {
          await serviceEndpoint.RegisterGesture(CountGesture, true);
        }
      }
      catch
      {
        Console.WriteLine("Exception thrown in starting up...");
      }
    }
    static void OnTriggered(object sender, GestureSegmentTriggeredEventArgs e)
    {
      Console.WriteLine($"Gesture {e.GestureSegment.Name} triggered!");
    }
    static GesturesServiceEndpoint ServiceEndpoint
    {
      get
      {
        if (serviceEndpoint == null)
        {
          serviceEndpoint = GesturesServiceEndpointFactory.Create();
        }
        return (serviceEndpoint);
      }
    }
    static Gesture CountGesture
    {
      get
      {
        if (countGesture == null)
        {
          var poses = new List<HandPose>();

          var allFingersContext = new AllFingersContext();

          // Hand starts upright, forward and with fingers spread...
          var startPose = new HandPose(
            "start",
            new FingerPose(
              allFingersContext, FingerFlexion.Open),
            new FingertipDistanceRelation(
              allFingersContext, RelativeDistance.NotTouching));

          poses.Add(startPose);

          foreach (Finger finger in
            new[] { Finger.Index, Finger.Middle, Finger.Ring, Finger.Pinky })
          {
            poses.Add(
              new HandPose(
              $"pose{finger}",
              new FingertipDistanceRelation(
                Finger.Thumb, RelativeDistance.Touching, finger)));
          }
          countGesture = new Gesture("count", poses.ToArray());
          countGesture.Triggered += OnTriggered;
        }
        return (countGesture);
      }
    }
    static Gesture countGesture;
    static GesturesServiceEndpoint serviceEndpoint;
  }
}

I’m very unsure as to whether my code is specifying my gesture ‘completely’ or ‘accurately’ but what amazed me about this is that I really only took one stab at it and it “worked”.

That is, I can run my app and see my gesture being built up from its 5 constituent poses in the ‘Geek View’ and then my console app has its event triggered and displays the right output;

image

What I’d flag about that code is that it is bad in that it’s using async/await in a console app and so it’s likely that thread pool threads are being used to dispatch all the “completions” which mean that lots of threads are potentially running through this code and interacting with objects which may/not have thread affinity – I’ve not done anything to mitigate that here.

Other than that, I’m impressed – this was a real joy to work with and I guess the only way it could be made easier would be to allow for the visual drawing or perhaps the recording of hand gestures.

The only other thing that I noticed is that my CPU can get a bit active while using these bits and they seem to run at about 800MB of memory but then Project Prague is ‘Experimental’ right now so I’m sure that could change over time.

I’d like to also try this code on a Kinect for Windows V2 – if I do that, I’ll update this post or add another one.