Kinect for Windows V2 SDK: Hello ‘Custom Gesture’ World Part 3

After a brief hiatus, I wanted to follow up on that previous post;

Kinect for Windows V2 SDK- Hello ‘Custom Gesture’ World Part 2

which was itself a follow up from an earlier post;

Kinect for Windows V2 SDK- Hello ‘Custom Gesture’ World

with a post that tried to make use of the basic, custom gesture that I’d created.

Mainly because I have the code around, I thought I’d go back to one of my favourite topics which is searching for photos on flickR and I attempted to make a little WPF application which used the gesture that I built in those previous posts for moving “forwards” on a potentially large set of photos.

Here’s a video of the WPF application running on my desktop – I included a KinectUserViewer so that you could see my attempts at the gesture and the sort of success rate I am (or am not) getting. Please remember that I’ve done very little to teach the algorithm about this gesture.

In this case, the application is hard-coded to search for pictures using the term “beach”. I think I may be pining for a holiday Smile At the start of the video, I unplug the Kinect sensor and plug it back in just to show that code works ok.

Given how little I did training on that gesture, I think this is pretty good. In terms of putting this together, I’d say there are few different areas.

Code to Search on Flickr

I’ve built and published this sort of .NET code many times but it seems like every time I want to do some kind of search, I find myself tweaking this code rather than just re-using it. In this particular case, I wanted a paged flickR search for images to appear as though it was an “endless list” and my first intention was to deliver it into my application as an IObservable<T> but, to be honest, I struggled with getting that to work in the way in which I wanted it to. I kept producing observables that were too “aggressive” in the way that they went about their HTTP work.

So, I went via a different route. I wrote this little class to represent a search URL for flickR;

namespace WpfApplication12.Flickr
{
  using System.Collections.Generic;
  using System.Text;

  internal class FlickrSearchUrl
  {
    static string apiKey = "OMITTED";
    static string serviceUri = "https://api.flickr.com/services/rest/?method=";
    static string baseUri = serviceUri + "flickr.photos.search&";
    public int ContentType { get; set; }
    public int PerPage { get; set; }
    public int Page { get; set; }
    public List<string> SearchTags { get; private set; }

    private FlickrSearchUrl()
    {

    }
    public FlickrSearchUrl(
        string searchTag,
        int pageNo = 0,
        int perPage = 5,
        int contentType = 1
    )
      : this()
    {
      this.SearchTags = new List<string>()
      {
        searchTag
      };
      this.Page = pageNo;
      this.PerPage = perPage;
      this.ContentType = contentType;
    }
    public override string ToString()
    {
      // flickR has pages numbered 1..N (I think :-)) whereas this class thinks
      // of 0..N-1. 
      StringBuilder tagBuilder = new StringBuilder();
      for (int i = 0; i < this.SearchTags.Count; i++)
      {
        tagBuilder.AppendFormat("{0}{1}",
          i != 0 ? "," : string.Empty,
          this.SearchTags[i]);
      }
      return (
        string.Format(
          baseUri +
          "api_key={0}&" +
          "safe_search=1&" +
          "tags={1}&" +
          "page={2}&" +
          "per_page={3}&" +
          "content_type={4}",
          apiKey, tagBuilder.ToString(), this.Page + 1, this.PerPage, this.ContentType));
    }
  }
}

and here’s a little class that I use to represent an individual search result returned by flickR (in XML format in my case);


namespace WpfApplication12.Flickr
{
  using System.Xml.Linq;

  public class FlickrPhotoResult
  {
    public FlickrPhotoResult(XElement photo)
    {
      Id = (long)photo.Attribute("id");
      Secret = photo.Attribute("secret").Value;
      Farm = (int)photo.Attribute("farm");
      Server = (int)photo.Attribute("server");
      Title = photo.Attribute("title").Value;
    }
    public long Id { get; private set; }
    public string Secret { get; private set; }
    public int Farm { get; private set; }
    public string Title { get; private set; }
    public int Server { get; private set; }

    public string ImageUrl
    {
      get
      {
        return (string.Format(
          "http://farm{0}.static.flickr.com/{1}/{2}_{3}_z.jpg",
          Farm, Server, Id, Secret));
      }
    }
  }   
}

and this is a class that I knocked together to actually do HTTP searches, bring back XML results, deserialize them into objects and then manage the paging of data;

namespace WpfApplication12.Flickr
{
  using System.Collections.Generic;
  using System.IO;
  using System.Linq;
  using System.Net.Http;
  using System.Threading.Tasks;
  using System.Xml.Linq;

  class FlickrSearchResults
  {
    FlickrSearchResults()
    {
      this.currentResults = new Queue<FlickrPhotoResult>();
    }
    public FlickrSearchResults(string searchTerm) : this()
    {
      this.url = new FlickrSearchUrl(searchTerm);
    }
    public FlickrSearchResults(FlickrSearchUrl url) : this()
    {
      this.url = url;
    }
    public async Task<FlickrPhotoResult> GetNextResultAsync()
    {
      if (this.currentResults.Count == 0)
      {
        await this.GetAnyNextPageAsync();
      }
      return (
        this.currentResults.Count == 0 ? null : this.currentResults.Dequeue());
    }
    async Task GetAnyNextPageAsync()
    {
      if ((this.maxPages == null) || (this.url.Page < this.maxPages))
      {
        if (this.maxPages.HasValue)
        {
          this.url.Page++;
        }
        await this.GetPageForCurrentUrlAsync();
      }
    }
    static async Task<XElement> GetHttpReponseToXElement(string url)
    {
      HttpClient client = new HttpClient();
      XElement xml = null;

      using (HttpResponseMessage response = await client.GetAsync(url))
      {
        if (response.IsSuccessStatusCode)
        {
          using (Stream stream = await response.Content.ReadAsStreamAsync())
          {
            xml = XElement.Load(stream);
          }
        }
      }
      return (xml);
    }
    async Task GetPageForCurrentUrlAsync()
    {
      XElement xml = await GetHttpReponseToXElement(this.url.ToString());

      var photoList =
      (
          from p in xml.DescendantsAndSelf("photo")
          select new FlickrPhotoResult(p)
      );

      if (this.maxPages == null)
      {
        this.maxPages = (int)xml.DescendantsAndSelf("photos").First().Attribute("pages");
      }
      foreach (var photo in photoList)
      {
        this.currentResults.Enqueue(photo);
      }
    }
    Queue<FlickrPhotoResult> currentResults;
    FlickrSearchUrl url;
    int? maxPages;
  }
}

at the end of all that, the idea is that a consumer of these classes can write code such as;

      FlickrSearchResults results = new FlickrSearchResults("beach");
      bool loop = true;

      while (loop)
      {        
        FlickrPhotoResult result = await results.GetNextResultAsync();
        loop = result != null;

        if (loop)
        {
          // do something with the result
        }
      }

to consume a “potentially very large” set of results from flickR one at a time.

Code to Track a Single Body in Front of the Sensor

I wanted to know when the sensor had “sight” of a single, tracked human body and so I wrote this little class which surfaces an IObservable<Body> which is the stream representing the first (or no) Body that the sensor is currently seeing;

namespace WpfApplication12
{
  using Microsoft.Kinect;
  using System;
  using System.Linq;
  using System.Reactive.Subjects;

  class SingleBodyDataSource
  {
    public SingleBodyDataSource()
    {
      this.subBody = new Subject<Body>();
    }
    public KinectSensor Sensor
    {
      get
      {
        return (this.sensor);
      }
    }
    public void Open(KinectSensor sensor)
    {
      this.sensor = sensor;
      this.bodies = new Body[this.sensor.BodyFrameSource.BodyCount];

      this.bodyReader = this.sensor.BodyFrameSource.OpenReader();
      this.bodyReader.FrameArrived += OnBodyFrameArrived;
    }
    void OnBodyFrameArrived(object sender, BodyFrameArrivedEventArgs e)
    {
      using (var frame = e.FrameReference.AcquireFrame())
      {
        Body firstTrackedBody = null;

        if (frame != null)
        {
          frame.GetAndRefreshBodyData(this.bodies);

          firstTrackedBody = this.bodies.Where(b => b.IsTracked).FirstOrDefault();
        }
        // might be publishing a null or not...
        this.subBody.OnNext(firstTrackedBody);
      }
    }
    public void Close()
    {
      this.bodies = null;

      this.bodyReader.FrameArrived -= this.OnBodyFrameArrived;

      this.subBody.OnCompleted();
      this.subBody.Dispose();

      // related to the comment on BodyData below
      this.subBody = new Subject<Body>();

      this.bodyReader.Dispose();
      this.bodyReader = null;
    }
    // TBD on this. If someone calls subscribe then Open/Close then they'll
    // get end of sequence. They'd need to then ask for this property again
    // in order to subscribe if they wanted to Open/Close again which is a
    // bit poor on my part. Need to improve that.
    public IObservable<Body> SingleTrackedBodySequence
    {
      get
      {
        return (this.subBody);
      }
    }
    Body[] bodies;
    BodyFrameReader bodyReader;
    KinectSensor sensor;
    Subject<Body> subBody;
  }
}

Code to Track My Custom Gesture

I wanted to know whenever the user performed my custom gesture and so I added the gesture database into the project;

image

and then wrote this class which takes into its constructor;

  • the path to the gesture database that you want to open
  • the set of names of gestures within that database that you’re interested in

and then it surfaces an IObservable<GestureInfo> where I defined the little class GestureInfo to look like this;

  class GestureInfo
  {
    public string Name { get; set; }
    public float Progress { get; set; }
  }

and the class itself is;

namespace WpfApplication12
{
  using Microsoft.Kinect;
  using Microsoft.Kinect.VisualGestureBuilder;
  using System;
  using System.Collections.Generic;
  using System.Linq;
  using System.Reactive.Subjects;

  class GestureInfo
  {
    public string Name { get; set; }
    public float Progress { get; set; }
  }
  class GestureDataSource
  {
    public GestureDataSource(
      string databasePath,
      string[] gestureNames)
    {    
      this.databasePath = databasePath;
      this.gestureNames = gestureNames;
      this.subGestureInfo = new Subject<GestureInfo>();
    }
    public IObservable<GestureInfo> GestureInfoSequence
    {
      get
      {
        return (this.subGestureInfo);
      }
    }
    public void Open(SingleBodyDataSource singleBodySource)
    {
      this.singleBodyDataSource = singleBodySource;

      this.interestedGestures = new List<Gesture>();

      this.database = new VisualGestureBuilderDatabase(this.databasePath);

      foreach (var gesture in database.AvailableGestures)
      {
        if (this.gestureNames.Contains(gesture.Name))
        {
          this.interestedGestures.Add(gesture);
        }
      }
      this.source = new VisualGestureBuilderFrameSource(
        this.singleBodyDataSource.Sensor, 0);

      this.source.AddGestures(this.interestedGestures);
      this.source.TrackingIdLost += OnTrackingIdLost;
      this.reader = this.source.OpenReader();
      this.reader.IsPaused = true;
      this.reader.FrameArrived += this.OnGestureFrameArrived;

      this.singleBodiesSubscription = this.singleBodyDataSource.SingleTrackedBodySequence.Subscribe(
        OnSingleBodyFrame);
    }
    void OnTrackingIdLost(object sender, TrackingIdLostEventArgs e)
    {
      this.reader.IsPaused = true;
    }
    void OnSingleBodyFrame(Body body)
    {
      if (body != null)
      {
        this.source.TrackingId = body.TrackingId;

        if (this.reader.IsPaused)
        {
          this.reader.IsPaused = false;
        }
      }
    }
    void OnGestureFrameArrived(object sender, VisualGestureBuilderFrameArrivedEventArgs e)
    {
      using (var frame = e.FrameReference.AcquireFrame())
      {
        if (frame != null)
        {
          // we're only doing continuous gestures here. easy to switch to add in
          // discrete ones too.
          var frameGestures = frame.ContinuousGestureResults;

          if (frameGestures != null)
          {
            foreach (var gesture in this.interestedGestures)
            {
              if (frameGestures.ContainsKey(gesture))
              {
                // allocs like crazy...:-S
                var info = new GestureInfo()
                {
                  Name = gesture.Name,
                  Progress = frameGestures[gesture].Progress
                };
                this.subGestureInfo.OnNext(info);
              }
            }
          }
        }
      }
    }
    public void Close()
    {
      this.subGestureInfo.OnCompleted();
      this.subGestureInfo.Dispose();
      this.subGestureInfo = new Subject<GestureInfo>();

      this.singleBodiesSubscription.Dispose();
      this.singleBodiesSubscription = null;

      this.reader.FrameArrived -= this.OnGestureFrameArrived;
      this.reader.Dispose();
      this.reader = null;

      this.source.TrackingIdLost -= this.OnTrackingIdLost;
      this.source.Dispose();
      this.source = null;

      foreach (var gesture in this.interestedGestures)
      {
        gesture.Dispose();
      }
      this.interestedGestures = null;
      this.database.Dispose();
      this.database = null;
    }
    VisualGestureBuilderDatabase database;
    VisualGestureBuilderFrameSource source;
    VisualGestureBuilderFrameReader reader;
    SingleBodyDataSource singleBodyDataSource;
    IDisposable singleBodiesSubscription;
    Subject<GestureInfo> subGestureInfo;
    string databasePath;
    string[] gestureNames;
    List<Gesture> interestedGestures;
  }
}

The idea of this class is that it takes the SingleBodySource so that it can keep a track of the first body that is being tracked by the sensor. It then uses the VisualGestureBuilderFrameSource as per my previous post in order to open up a VisualGestureFrameReader to read frames from the sensor that relate to the gestures that the class has been tasked with monitoring.

It only attempts to monitor “continuous” gestures rather than “discrete” gestures although that functionality could easily be added and wherever the class sees progress being reported from one of its nominated gestures then it attempts to push that data out via its GestureInfoSeqeunce property.

That means that I now have a “stream” consisting of tuples of name/progress coming out of this class whenever it detects something happening with one of the gestures that I’ve given it.

Code to Bring Data Sources Together

I bundled those two previous data sources into a single DataSource class to surface all the data that I’ve got available via 3 observable properties – SingleTrackedBodySequence, GestureInfoSequence and an additional observable called SensorAvailableSequence which surfaces when the Kinect sensor itself is/is not available.

namespace WpfApplication12
{
  using Microsoft.Kinect;
  using Microsoft.Kinect.Wpf.Controls;
  using System;
  using System.Reactive.Subjects;

  class DataSource
  {
    public DataSource(
      string gestureDatabasePath,
      string[] gestureNames)
    {
      this.subSensorAvailable = new Subject<bool>();

      this.singleBodySource = new SingleBodyDataSource();
      
      this.gestureSource = new GestureDataSource(gestureDatabasePath,
        gestureNames);
    }
    public IObservable<Body> SingleTrackedBodySequence
    {
      get
      {
        return (this.singleBodySource.SingleTrackedBodySequence);
      }
    }
    public IObservable<GestureInfo> GestureInfoSequence
    {
      get
      {
        return (this.gestureSource.GestureInfoSequence);
      }
    }
    public void Open()
    {
      this.sensor = KinectSensor.GetDefault();
      this.sensor.Open();
      this.sensor.IsAvailableChanged += OnSensorAvailabilityChanged;

      this.singleBodySource.Open(this.sensor);
      this.gestureSource.Open(this.singleBodySource);
    }
    public void Close()
    {
      this.gestureSource.Close();
      this.singleBodySource.Close();

      this.sensor.Close();
      this.sensor.IsAvailableChanged -= OnSensorAvailabilityChanged;
      this.sensor = null;
    }
    public IObservable<bool> SensorAvailableSequence
    {
      get
      {
        return (this.subSensorAvailable);
      }
    }
    void OnSensorAvailabilityChanged(bool isAvailable)
    {
      this.subSensorAvailable.OnNext(isAvailable);
    }
    void OnSensorAvailabilityChanged(object sender, 
      IsAvailableChangedEventArgs e)
    {
      this.OnSensorAvailabilityChanged(e.IsAvailable);
    }
    public KinectSensor Sensor
    {
      get
      {
        return (this.sensor);
      }
    }
    Subject<bool> subSensorAvailable;
    KinectSensor sensor;
    SingleBodyDataSource singleBodySource;
    GestureDataSource gestureSource;
  }
}

From here, I’m at the point where I can put a little UI on top of this code.

Adding a Minimal UI

I added a little bit of XAML to display an Image and to display a KinectUserViewer so that I can see that something is going on because the custom gesture that I came up with is a little unintuitive to say the least;

<Window xmlns:my="http://schemas.microsoft.com/kinect/2014"
        x:Class="WpfApplication12.MainWindow"
        xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
        xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
        Title="MainWindow"
        Height="600"
        Width="800">

  <Window.Resources>
    <Style TargetType="Grid">
      <Setter Property="Background"
              Value="Black" />
    </Style>
    <Style TargetType="TextBlock">
      <Setter Property="Foreground"
              Value="White" />
      <Setter Property="FontSize"
              Value="24" />
      <Setter Property="HorizontalAlignment"
              Value="Center" />
      <Setter Property="VerticalAlignment"
              Value="Center" />
      <Setter Property="TextAlignment"
              Value="Center" />
    </Style>
    <Style TargetType="Image">
      <Setter Property="Stretch"
              Value="UniformToFill" />
      <Setter Property="Margin"
              Value="48" />
    </Style>
  </Window.Resources>
  <Grid>
    <Grid>
      <TextBlock Text="loading..." />
      <Image x:Name="imgFlickr" />
    </Grid>
    <Grid x:Name="gridNotAvailable"
          Visibility="Collapsed">
      <TextBlock Text="no sensor" />
    </Grid>
    <Grid x:Name="gridNoSingleUser">
      <TextBlock Text="not seeing a single user to play with" />
    </Grid>
    <Grid Width="Auto"
          Height="Auto"
          HorizontalAlignment="Right"
          VerticalAlignment="Bottom"
          Background="Transparent">
      <my:KinectRegion x:Name="region">
        <Grid Background="Transparent">
          <my:KinectUserViewer DefaultUserColor="Orange"
                               x:Name="userViewer" />
        </Grid>
      </my:KinectRegion>
    </Grid>
  </Grid>
</Window>

the idea of the UI being that I have layers of;

  • A “loading…” textblock under
  • An Image under
  • A Grid saying “no sensor” under
  • A Grid saying “not seeing a user”

and then the code-behind can play around with the visibilities here to try and show/hide these elements at the right times. That code-behind then looks like this;

namespace WpfApplication12
{
  using Microsoft.Kinect;
  using Microsoft.Kinect.Wpf.Controls;
  using System;
  using System.Reactive.Linq;
  using System.Threading;
  using System.Threading.Tasks;
  using System.Windows;
  using System.Windows.Media.Imaging;
  using WpfApplication12.Extensions;
  using WpfApplication12.Flickr;
  public partial class MainWindow : Window
  {
    public MainWindow()
    {
      InitializeComponent();
      this.Loaded += OnLoaded;
    }
    async void OnLoaded(object sender, RoutedEventArgs e)
    {
      // open up our data source, feeding through the path to the database
      // of custom gestures and the name of the gesture we care about.
      this.source = new DataSource(DB, new string[] { GESTURE_NAME });

      // create our search for flickR images of "beach"
      this.flickrResults = new FlickrSearchResults("beach");

      // we subscribe so that can show/hide UI as when the Kinect sensor becomes
      // available/unavailable.
      this.availableSubscription =
        this.source.SensorAvailableSequence.Subscribe(this.OnSensorAvailable);

      // we subscribe so that we can show/hide UI as and when we get info that
      // there is at least one tracked user in front of the console.
      this.singleUserSubscription =
        this.source.SingleTrackedBodySequence.Subscribe(this.OnBodyAvailable);

      // we subscribe to the gesture events for our gesture name (in this case,
      // this is redundant as we only have one gesture).
      // We take a look at the progress over 2 seconds. When that progress ranges
      // from a low value (0.3) to a high value (0.9) within 2 seconds then we
      // consider the gesture to have happened and we only want to know about
      // the ones where the value is true rather than all the false values.
      var gestureSequence =
        this.source.GestureInfoSequence
          .Where(gi => gi.Name == GESTURE_NAME)
          .Select(gi => gi.Progress)
          .Buffer(TimeSpan.FromSeconds(2))
          .ObserveOn(SynchronizationContext.Current)
          .Select(list => list.Spans(LOW_PROGRESS, HIGH_PROGRESS))
          .DistinctUntilChanged()
          .Where(b => b);

      // subscribe to that.
      this.gestureSubscription = 
        gestureSequence.Subscribe(b => this.OnGesture());

      // open up the source so that it makes all its internal subscriptions.
      this.source.Open();

      // initialse the bits that the KinectUserViewer needs. Added this most
      // recently.
      this.InitialiseUserViewer();

      // make sure that we display the first image (although it might be hidden
      // by other pieces of UI).
      await this.MoveToAnyNextImageAsync();
    }
    void InitialiseUserViewer()
    {
      // I'm not 100% sure of the "minimum" code needed to get a KinectUserViewer

      this.region.KinectSensor = this.source.Sensor;
      this.userViewer.KinectSensor = this.source.Sensor;

      KinectRegion.SetKinectRegion(this.userViewer, this.region);
    }
    void OnSensorAvailable(bool available)
    {
      this.gridNotAvailable.Visibility = available ?
        Visibility.Collapsed : Visibility.Visible;

      this.gridNoSingleUser.Visibility = available ?
        Visibility.Visible : Visibility.Collapsed;
    }
    async void OnGesture()
    {
      this.imgFlickr.Source = null;
      await MoveToAnyNextImageAsync();
    }
    private async Task MoveToAnyNextImageAsync()
    {
      var photoResult = await this.flickrResults.GetNextResultAsync();
      BitmapImage image = new BitmapImage(new Uri(photoResult.ImageUrl));
      this.imgFlickr.Source = image;
    }
    void OnBodyAvailable(Body body)
    {
      this.gridNoSingleUser.Visibility =
        body == null ? Visibility.Visible : Visibility.Collapsed;
    }
    void OnStop()
    {
      this.singleUserSubscription.Dispose();
      this.singleUserSubscription = null;
      this.availableSubscription.Dispose();
      this.availableSubscription = null;
      this.source.Close();
    }
    IDisposable gestureSubscription;
    IDisposable singleUserSubscription;
    IDisposable availableSubscription;
    DataSource source;
    FlickrSearchResults flickrResults;

    static readonly float LOW_PROGRESS = 0.3f;
    static readonly float HIGH_PROGRESS = 0.9f;
    static readonly string DB = @"Gestures\GestureDatabase.gbd";
    static readonly string GESTURE_NAME = "swipeForwardProgress";
  }
}

and that’s pretty much it – it’s largely just a subscription to a couple of those observables that I’ve built up in the previous snippets and then linking that in to requesting the next image from the flickR search at the point in time that the gesture is detected.

Code

Here’s the code that I wrote for this for download if you want all/pieces of it.