Kinect for Windows V2 SDK: Hello ‘Custom Gesture’ World

Continuing on this series of posts where I’ve been exploring the Kinect for Windows V2 SDK, I have seen the video on Channel 9 which talks about custom gesture recognition;

and this is another great source of info;

and I’ve also seen my colleague Pete demonstrating how custom gesture recognition works but, like anything, I wanted to try it out for myself.

As you’d expect, based on the skeletal tracking data that comes back from the sensor a developer can do some gesture recognition themselves. I even managed to do a little of it in this post;

Kinect for Windows SDK V2: Mixing with some Reactive Extensions

where I tried to recognise a simple “clap” gesture by programmatically monitoring the distance between the hands move from some non-zero value towards zero and then away from zero again all within some time period.

That’s a valid thing to do but it involves writing custom code and that code may well become quite complicated for a gesture that’s not so easy to describe in such simple terms.

So, I figured I’d try out the other approach described in the video referenced above which is to make use of a combination of the Kinect Studio tool to record sensor output where a user is performing a gesture and then feed that data into the Gesture Builder tool which can apply machine learning techniques to figuring out what does/doesn’t define a gesture.

The first thing to do is to come up with a suitable gesture. What I wanted to try out was a simple left/right swipe. That’s perhaps not a brilliant example because it’s perhaps too easy for a user to casually swipe with their hands while talking or doing some other activity so I thought I’d try and refine it a little by having a gesture where the user;

  • raises up their left hand in front of their body
  • swipes their right hand away from their left hand horizontally across the body

this would be a “forward swipe” gesture and I’d perhaps also define a “reverse swipe” gesture that goes in the other direction. It’s perhaps easier to illustrate with a video;

At the time of writing, I only have myself to provide training data whereas it’d more than likely be beneficial to have a bunch of users that I can ask to try out these gestures so that the algorithms get more variety of data to work on as input.

Regardless, I switched Kinect Studio over into “record” mode and made sure that I was recording the body frame, depth, IR and sensor telemetry sources;


and then I recorded myself performing this gesture over and over again – in total, around 10-15 repetitions of the gesture. Each time, I started with hands-by-side and then raised up my left hand, closed my fist and then brought up my right hand and did a swipe away from the left hand across the body.

In order to see if I could get this recognised automatically, I ran up the “Gesture Builder” app that ships as part of the SDK. In all honesty, it feels like the UI on this could use a little bit of love but it does have “PREVIEW” very clearly stamped across it and it’s doing some pretty complex stuff so I’m more than happy to give it the benefit of the doubt;


I made a new solution and added a new project to that solution using the wizard and tried as best as I could to answer the questions it asked sensibly;








and, after the questions, the Wizard wanted to make me 2 gestures but I only let it make one – the one which represented the “progress” version of the gesture.

There are two detection technologies in play here;

Detection Technologies

with one being a binary “Yes/No – the user is currently performing the gesture” which I guess would work for something like “the user is standing on one leg” or “the user has one hand above their head” and so on.

In my case, I wanted more of a “progressive” gesture and so that’s the road that I went down here and just asked the tool to make me that version of the gesture and the tool then emitted 2 projects for me into the solution.

Those projects are;


where the red project is the training project where I get to teach the algorithm how to recognise the gesture in as many recorded clips as I have available to train with and the green project is an analysis project where I can run the data generated by training on some clips and see how well/badly the recognition works.

I added my recorded clip to the training project as you can see above – at this point, I should note that my clip is over 2GB of data large and it was a fairly short clip – watch out for those file sizes, these files get big.

The next step is to use the UI to try and train the algorithm;


In the image above;

  • the green arrow indicates the clip I’m using for training. I only have 1 here so that’s fairly simple.
  • the red arrow indicates the position of the play head in the data streams
  • the blue arrows are pointing to the places in the video where I am telling the algorithm “we are not doing the gesture here”
  • the black arrows are pointing to the places in the video where I am telling the algorithm “this is the gesture completed”

and the ramp below with all the arrows on it from 0 to 1.0 which repeats shows the progress of the gesture across the recorded video/IR/body data.

It’s relatively easy to do this markup – the tool allows you to use the right cursor to slide along the recorded data frames and then you hit the SPACE bar for a “not doing the gesture” and you hit the ENTER key for “this is the end of the gesture” and, as such, you markup the frames.

With that done on my single recording, I went ahead and told the tool to “BUILD”;


and it whirs away for a while and emits some output;


and out pops a .gba file containing the results. Armed with that, I can do one of 3 things;

  • Try it out in the live previewer tool.
  • Test it out using the analysis project in the gesture builder tool here.
  • Load it up with some code and see how that works out.

The Gesture Builder Tool – Analysing

Having used some clips (1 in my case) to train the algorithm, I can now use some clips to test the algorithm.

In my case, I’m going to use the same clip as I used to train the algorithm so I’d expect a fairly good correlation between the training data and the analysed data.

I can load up the clip in the analysis project that the wizard generated for me and then click Analyze;


and then I can have a look at this analysis. As shown below;


You can perhaps see that I’ve analysed this file twice (red arrows) and at the particular point in the timeline (bottom green arrow) the algorithm was pretty confident that I was around 67% of my way through the gesture.

Naturally, I’d need to get a bunch of other recordings of other people performing the gesture to analyse against to know if my training was up to scratch and, if not, I’d probably need to do some more training against those recordings.

But…for now, the algorithm seems to have understood what I taught it.

The Live Preview Tool

The other way in which I can try out my training data is to run the Live Preview tool which can be quick-launched from the right menu on either of the 2 projects in my solution.

I then feed that tool the .gba file that was emitted as part of my “Building” the training project and it opens up this view below (in a strangely, non-resizable window).


This view is then running code which has loaded up my .gba file and it is trying to show in realtime whether my gesture is being executed and, if so, at what percentage of completion. It’s worth saying that the tool has 2 modes with the other mode being more about showing the binary nature of a gesture which doesn’t have progress but is more about a On/Off decision.

Here’s a little screen recording of my trying to put this training data through its paces;

It’s surprisingly good given how little I’ve tried to train it but you’ll notice that there’s definitely some “similar” gestures that I perform here which it seems to give positive response to which it probably shouldn’t.

However, more training would no doubt help out on that.

Detecting Gestures from an App

The next step would be for me to be able to write my own app code which can pick up some of this trained data, load up some or all of the gestures within that data and then respond as/when a user is performing one of those gestures.

This post is getting long though so I’ll put that into a follow on post…