Kinect for Windows V2: Hello (Audio) World for the .NET Windows App Developer & Harmonica Player ;-)

Returning back to this series of posts, I wanted to continue experimenting with the Kinect for Windows V2 SDK and to try out how I would go about collecting audio data from the sensor.

I tried to come up with something “interesting” or “semi-real” to do with the audio data rather than just to draw the angle that the Kinect thinks the beam is coming from based upon its array of microphones and the best that I could come up with was to experiment by linking the Kinect up with another little hobby of mine where I’ve been trying (not very successfully) to learn to play the diatonic harmonica.

I’m still very much a beginner with the harmonica and one of the areas where I’m still struggling is with getting decent, clean, bent notes.

If you’ve not tried a harmonica then essentially the diatonic has 10 holes with each producing 2 notes (one for blow, one for suck) but a number of holes can also produce additional notes by altering the air flow over the reeds inside the device and causing a note to “bend”.

So, I figured that I’d try and put together a simple prototype app which could help me identify whether I was producing the right note from the harmonica as I’m playing it. In doing so I wanted the app to figure out;

  1. Is there anyone standing in front of the Kinect sensor?
  2. Are they alone? If so my hope is that the Kinect will use its automatic capability on focusing on the sound coming from them rather than from anywhere else although I could have taken a manual approach to directing the sound input – the device supports that.
  3. Do they look like they are playing a harmonica? For this, I went with the simple idea of them having their hands near their head. That isn’t always going to work and especially not if they are e.g. playing a guitar at the same time but it works for what I need. I can’t play the harmonica on its own, never mind combined with a guitar.
  4. What note are they playing? Can the note be displayed and can some guidance be given as to whether the note is accurate or whether it needs tweaking a bit.

and I wanted all of that to work across the user entering/leaving the scene, putting the harmonica down, unplugging the sensor and plugging it back in and so on.

Here’s a little video of where I’ve got to so far in terms of the app that I knocked together. Please excuse the harmonica notes and you might notice that the highest frequency notes don’t work – I’m still trying to see if there’s anything that I can do on that;

I should add some credits for the images used in that code;

In order to get this going, I needed to capture audio from the sensor and I was very pleased to find that the mechanism for capture followed the same model as all the other APIs that I’ve used so far.

I call this model “SRF” or ‘surf’ – i.e.

  • get the Sensor
  • open a Reader
  • process the Frames

and it’s so nice to see it’s the same approach all across the API surfaces.

I needed a library to try and do pitch detection for me on a PCM stream and I managed to find this library up on CodePlex which does “auto correlation” on the audio streams it receives;

SNAGHTML18aac869

Note that this code is licensed under the Ms-PL and so my code here also becomes licensed under the Ms-PL.

As far as I know, the Kinect produces audio data every 16ms and from testing with this library it seems to be more than fast enough to handle the sampled audio data at that frequency but I’m not sure that it’s going to produce the accuracy that I need here and the author of the library (in the discussions on CodePlex) has said that they’ve given up on it and are trying out a new approach but I found that this library gives me 95% of what I need although I need to see if I can get that last 5% somehow.

Anyway, I made a Windows Store app, built a little library out of this CodePlex code and referenced both that library and the Kinect SDK. I also decided that I’d use the Reactive Extensions here so I referenced those too;

image

With that in place, I tried to knock up a static class which would surface some observable streams of data from the Kinect representing;

  1. When the sensor was/wasn’t available.
  2. When a user was/wasn’t in front of the sensor.
  3. When the user did/didn’t have their hands near their face.
  4. Notes as detected by the pitch processing algorithm.

That class ended up looking like this;

namespace App247
{
  using App247.Extensions;
  using Pitch;
  using System;
  using System.Linq;
  using System.Reactive.Subjects;
  using WindowsPreview.Kinect;

  static class KinectDataSource
  {
    static KinectDataSource()
    {
      subSensorAvailable = new Subject<bool>();
      subSingleUserInFrontOfSensor = new Subject<bool>();
      subPitchRecords = new Subject<PitchTracker.PitchRecord>();
      subUserHandsNearFace = new Subject<bool>();
    }
    public static void Initialise()
    {      
      pitchTracker = new PitchTracker() 
      { 
        SampleRate = SAMPLE_RATE 
      };

      sensor = KinectSensor.GetDefault();
      sensor.Open();
      sensor.IsAvailableChanged += OnIsSensorAvailableChanged;

      bodies = new Body[sensor.BodyFrameSource.BodyCount];
      audioFrameData = new byte[sensor.AudioSource.SubFrameLengthInBytes];
      audioPitchData = new float[sensor.AudioSource.SubFrameLengthInBytes / sizeof(float)];

      OpenOrCloseSensor();
    }
    static void OnIsSensorAvailableChanged(KinectSensor sender, 
      IsAvailableChangedEventArgs args)
    {
      OpenOrCloseSensor();
    }
    static void OpenOrCloseSensor()
    {
      subSensorAvailable.OnNext(sensor.IsAvailable);

      if (sensor.IsAvailable)
      {
        OpenReaders();
      }
      else 
      {
        CloseReaders();
      }
    }
    static void CloseReaders()
    {
      if (audioReader != null)
      {
        audioReader.FrameArrived -= OnAudioFrameArrived;
        audioReader.Dispose();
        audioReader = null;
      }
      if (bodyReader != null)
      {
        bodyReader.FrameArrived -= OnBodyFrameArrived;
        bodyReader.Dispose();
        bodyReader = null;
      }
    }
    static void OpenReaders()
    {
      audioReader = sensor.AudioSource.OpenReader();
      audioReader.FrameArrived += OnAudioFrameArrived;
      bodyReader = sensor.BodyFrameSource.OpenReader();
      bodyReader.FrameArrived += OnBodyFrameArrived;
    }
    static void OnBodyFrameArrived(BodyFrameReader sender, 
      BodyFrameArrivedEventArgs args)
    {
      using (BodyFrame frame = args.FrameReference.AcquireFrame())
      {
        if (frame != null)
        {
          frame.GetAndRefreshBodyData(bodies);

          var singleBody = bodies.Count(b => b.IsTracked) == 1;
          var handsNearHead = false;

          subSingleUserInFrontOfSensor.OnNext(singleBody);

          if (singleBody)
          {
            var body = bodies.Single(b => b.IsTracked);
            var leftHand = body.Joints[JointType.HandLeft];
            var rightHand = body.Joints[JointType.HandRight];
            var head = body.Joints[JointType.Head];

            if ((leftHand.Position.DistanceTo(head.Position) < HANDS_NEAR_FACE_DISTANCE) && 
              (rightHand.Position.DistanceTo(head.Position) < HANDS_NEAR_FACE_DISTANCE))
            {
              handsNearHead = true;
            }
          }
          subUserHandsNearFace.OnNext(handsNearHead);
        }
      }
    }
    static void OnAudioFrameArrived(
      AudioBeamFrameReader sender, 
      AudioBeamFrameArrivedEventArgs args)
    {
      using (var frames = args.FrameReference.AcquireBeamFrames() as AudioBeamFrameList)
      {
        if (frames != null)
        {
          foreach (var frame in frames)
          {
            for (int i = 0; i < frame.SubFrames.Count; i++)
            {
              if (i == 0)
              {
                frame.SubFrames[i].CopyFrameDataToArray(audioFrameData);

                for (int j = 0; j < audioPitchData.Length; j++)
                {
                  audioPitchData[j] = 
                    BitConverter.ToSingle(audioFrameData, j * sizeof(float));
                }
                // I played around with timing this on data and it seems to execute in litle
                // enough time (on my machine) to cope with the 16ms period with which data
                // arrives. That would need around 60 processing cycles per second which this
                // algorithm seems to manage fine on this data size.
                pitchTracker.ProcessBuffer(audioPitchData);

                subPitchRecords.OnNext(pitchTracker.CurrentPitchRecord);
              }
              frame.SubFrames[i].Dispose();
            }
            frame.Dispose();
          }
        }
      }
    }
    internal static IObservable<bool> ObsSensorAvailable
    {
      get
      {
        return (subSensorAvailable);
      }
    }
    internal static IObservable<bool> ObsSingleUserPresent
    {
      get
      {
        return (subSingleUserInFrontOfSensor);
      }
    }
    internal static IObservable<PitchTracker.PitchRecord> ObsPitchRecords
    {
      get
      {
        return (subPitchRecords);
      }
    }
    internal static IObservable<bool> ObsUserHandsNearFace
    {
      get
      {
        return (subUserHandsNearFace);
      }
    }
    static PitchTracker pitchTracker;
    static byte[] audioFrameData;
    static float[] audioPitchData;
    static Body[] bodies;
    static Subject<bool> subSensorAvailable;
    static Subject<bool> subSingleUserInFrontOfSensor;
    static Subject<bool> subUserHandsNearFace;
    static Subject<PitchTracker.PitchRecord> subPitchRecords;
    static BodyFrameReader bodyReader;
    static AudioBeamFrameReader audioReader;
    static KinectSensor sensor;
    const float SAMPLE_RATE = 16000.0f;
    const float HANDS_NEAR_FACE_DISTANCE = 0.5f;
  }
}

In terms of topics that I’ve covered in previous posts, the only new bit here is using the KinectSensor.AudioSource to open up a reader and then handle its FrameArrived events. In the handling of that event I try to take some care to make sure that I Dispose() of everything that I make use of because I found that, otherwise, I’d stop getting events (which seems fair to me).

That event handler tries to grab the frame of audio data from the event arguments which looks to contain 1024 samples in byte[] form which I then chop down into 256 samples in float[] form (-1 to +1) before handing them on to the pitch processing algorithm.

I then have a main UI in the some XAML which is really just 4 user controls and a little code to show/hide those controls via the Visual State Manager;

<Page x:Class="App247.MainPage"
      xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
      xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
      xmlns:local="using:App247"
      xmlns:ctrl="using:App247.Controls"
      xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
      xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
      xmlns:k="using:Microsoft.Kinect.Xaml.Controls"
      mc:Ignorable="d">
  <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
    <VisualStateManager.VisualStateGroups>
      <VisualStateGroup x:Name="UserStateGroup">
        <VisualState x:Name="Sensor">
          <Storyboard>
            <ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)"
                                           Storyboard.TargetName="ctrlNoSensor">
              <DiscreteObjectKeyFrame KeyTime="0">
                <DiscreteObjectKeyFrame.Value>
                  <Visibility>Visible</Visibility>
                </DiscreteObjectKeyFrame.Value>
              </DiscreteObjectKeyFrame>
            </ObjectAnimationUsingKeyFrames>
          </Storyboard>
        </VisualState>
        <VisualState x:Name="User">
          <Storyboard>
            <ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)"
                                           Storyboard.TargetName="ctrlNoSingleUser">
              <DiscreteObjectKeyFrame KeyTime="0">
                <DiscreteObjectKeyFrame.Value>
                  <Visibility>Visible</Visibility>
                </DiscreteObjectKeyFrame.Value>
              </DiscreteObjectKeyFrame>
            </ObjectAnimationUsingKeyFrames>
          </Storyboard>
        </VisualState>
        <VisualState x:Name="Hands">
          <Storyboard>
            <ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)"
                                           Storyboard.TargetName="ctrlNoHandsNearFace">
              <DiscreteObjectKeyFrame KeyTime="0">
                <DiscreteObjectKeyFrame.Value>
                  <Visibility>Visible</Visibility>
                </DiscreteObjectKeyFrame.Value>
              </DiscreteObjectKeyFrame>
            </ObjectAnimationUsingKeyFrames>
          </Storyboard>
        </VisualState>
        <VisualState x:Name="Playing">
          <Storyboard>
            <ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)"
                                           Storyboard.TargetName="ctrlPlaying">
              <DiscreteObjectKeyFrame KeyTime="0">
                <DiscreteObjectKeyFrame.Value>
                  <Visibility>Visible</Visibility>
                </DiscreteObjectKeyFrame.Value>
              </DiscreteObjectKeyFrame>
            </ObjectAnimationUsingKeyFrames>
          </Storyboard>
        </VisualState>
      </VisualStateGroup>
    </VisualStateManager.VisualStateGroups>
    <Grid.RowDefinitions>
      <RowDefinition />
      <RowDefinition Height="8*" />
      <RowDefinition />
    </Grid.RowDefinitions>
    <Grid.ColumnDefinitions>
      <ColumnDefinition />
      <ColumnDefinition Width="8*" />
      <ColumnDefinition />
    </Grid.ColumnDefinitions>
    <Viewbox Stretch="Uniform"
             Grid.Row="1"
             Grid.Column="1">
      <ctrl:NoSensorControl Visibility="Collapsed"
                            x:Name="ctrlNoSensor"
                            Margin="0"
                            Grid.Row="1"
                            Grid.Column="1" />
    </Viewbox>
    <Viewbox Stretch="Uniform"
             Grid.Row="1"
             Grid.Column="1">
      <ctrl:NoSingleUserControl Visibility="Collapsed"
                                x:Name="ctrlNoSingleUser"
                                Margin="0"
                                Grid.Row="1"
                                Grid.Column="1" />
    </Viewbox>
    <Viewbox Stretch="Uniform"
             Grid.Row="1"
             Grid.Column="1">
      <ctrl:NoHandsPlayingControl Visibility="Collapsed"
                                  x:Name="ctrlNoHandsNearFace"
                                  Margin="0"
                                  Grid.Row="1"
                                  Grid.Column="1" />
    </Viewbox>
    <Viewbox Stretch="Uniform"
             Grid.Row="1"
             Grid.Column="1">
      <ctrl:PlayingControl Visibility="Collapsed"
                           x:Name="ctrlPlaying"
                           Margin="0"
                           Grid.Row="1"
                           Grid.Column="1" />
    </Viewbox>
  </Grid>
</Page>

with that code behind looking like;

namespace App247
{
  using Pitch;
  using System;
  using System.Diagnostics;
  using System.Reactive.Linq;
  using System.Threading;
  using Windows.UI.Xaml;
  using Windows.UI.Xaml.Controls;

  public sealed partial class MainPage : Page
  {
    [Flags]
    enum CurrentState : ushort
    {
      Sensor = 1,
      User = 2,
      Hands = 4,
      Playing = Sensor | User | Hands
    }
    public MainPage()
    {
      this.InitializeComponent();
      this.Loaded += OnLoaded;
    }
    void UpdateState(bool condition, CurrentState flag)
    {
      CurrentState visualState;

      if (condition)
      {
        this.state |= flag;
      }
      else
      {
        this.state &= ~flag;
      }
      if ((this.state & CurrentState.Sensor) == 0)
      {
        visualState = CurrentState.Sensor;
      }
      else if ((this.state & CurrentState.User) == 0)
      {
        visualState = CurrentState.User;
      }
      else if ((this.state & CurrentState.Hands) == 0)
      {
        visualState = CurrentState.Hands;
      }
      else
      {
        visualState = CurrentState.Playing;
      }
      VisualStateManager.GoToState(this, visualState.ToString(), true);
    }
    void OnLoaded(object sender, RoutedEventArgs e)
    {
      KinectDataSource.ObsSensorAvailable
        .DistinctUntilChanged()
        .Subscribe(
          available =>
          {
            this.UpdateState(available, CurrentState.Sensor);
          }
      );

      KinectDataSource.ObsSingleUserPresent
        .DistinctUntilChanged()
        .Subscribe(
          userPresent =>
          {
            this.UpdateState(userPresent, CurrentState.User);
          }
      );

      KinectDataSource.ObsUserHandsNearFace
        .DistinctUntilChanged()
        .Subscribe(
          nearFace =>
          {
            this.UpdateState(nearFace, CurrentState.Hands);
          }
      );

      // I'm undecided on this yet. If I leave in the pitch == 0.0f records then I
      // get a more accurate picture but things jump around a lot. If I filter them
      // out ( as below ) things are more steady but less accurate. Possibly add
      // some kind of windowing around the data?
      KinectDataSource.ObsPitchRecords
        .Where(p => p.Pitch > 0.0f)
        .DistinctUntilChanged(p => p.MidiNote)
        .Subscribe(
          pitchRecord =>
          {
            var noteName = PitchDsp.GetNoteName(pitchRecord.MidiNote, true, true);
            this.ctrlPlaying.ChangeNote(noteName, pitchRecord.MidiCents);
          }
      );
      KinectDataSource.Initialise();
    }
    CurrentState state;
  }
}

The four user controls in question are really quite simple things, typically displaying a picture and a progress wheel as this example shows;

image

but I also built this simple control which attempts to show the note being played;

image

and the green squares are instances of another control called NoteControl. That control simply displays a rectangle, a piece of text to indicate the note and it can aso display a green ellipse in response to a call to methods LightOn/LightOff which show/hide the ellipse in a cheap, visual states, style manner.

The parent PlayingControl then simply has a dictionary of which of these NoteControl instances maps to each note from the pitch detection algorithm and it notifies each NoteControl to LightOn/LightOff based on which note it is passed as the currently playing note.

You can see how that works from the code below (NB: the Tag property is used as a cheap/cheerful way of storing the note that the NoteControl instance represents);

namespace App247.Controls
{
  using Windows.UI.Xaml;
  using Windows.UI.Xaml.Controls;
  using System.Linq;
  using System.Collections.Generic;
  using System;

  public sealed partial class PlayingControl : UserControl
  {
    public PlayingControl()
    {
      this.InitializeComponent();
      this.Loaded += OnLoaded;
    }
    void OnLoaded(object sender, RoutedEventArgs e)
    {
      this.BuildTagNoteMap();
    }
    void BuildTagNoteMap()
    {
      this.noteControlMap = new Dictionary<string, NoteControl>();

      foreach (NoteControl noteControl in
        this.canvasHarmonica.Children.Where(c => c is NoteControl))
      {
        this.noteControlMap[(string)noteControl.Tag] = noteControl;
      }
    }
    void BeforeChangeDisplayNote(string newValue)
    {
      if (this.IsDisplayableNote)
      {
        var noteControl = this.noteControlMap[this.note];
        noteControl.LightOff();
      }
      this.txtDisplayNote.Text = string.Empty;
      this.progLowCents.Value = 0;
      this.progHighCents.Value = 0;
    }
    bool IsDisplayableNote
    {
      get
      {
        return (!string.IsNullOrEmpty(this.note) &&
          this.noteControlMap.ContainsKey(this.note));
      }
    }
    void AfterChangeDisplayNote()
    {
      if (this.IsDisplayableNote)
      {
        this.txtDisplayNote.Text =
          string.IsNullOrEmpty(this.note) ? string.Empty : this.note.Split(' ')[0];

        var noteControl = this.noteControlMap[this.note];
        noteControl.LightOn();

        if (this.midiCents < 0)
        {
          this.progLowCents.Value = Math.Abs(this.midiCents);
        }
        else if (this.midiCents > 0)
        {
          this.progHighCents.Value = this.midiCents;
        }
      }
    }
    public void ChangeNote(string midiNote, int midiCents)
    {
      this.BeforeChangeDisplayNote(midiNote);
      this.note = midiNote;
      this.midiCents = midiCents;
      this.AfterChangeDisplayNote();
    }
    Dictionary<string, NoteControl> noteControlMap;
    string note;
    int midiCents;
  }
}

The only other thing going on in that code is that the Pitch Tracking algorithm gives me a –50 to +50 confidence value as to how well the note has been hit (midiCents in the code).

I use this to update a couple of simple progress bars as a way of trying to indicate the accuracy with which the note is being played.

For now, that’s pretty much it – there are at least a couple of problems with what I have right now;

  1. The upper notes aren’t registering. When I play the high C and A, I find that they are being identified by the algorithm as C6, A6 whereas I think they are C7, A7.
  2. The algorithm is a bit “jittery” in that when playing notes the display flickers a little. I think I could do something with that from an Rx point of view by maybe windowing some of the data.
  3. I’m not sure if the algorithm will report any of the overblow bends because I don’t know how to play those so I can’t test it out Smile
  4. This only attempts to deal with a C harmonica. There are many other keys available Smile

But I had some fun playing with it and I may well revisit it. If the code’s of interest to you then it’s here for download (licensed under Ms-PL as above).