Conversations with the Language Understanding (LUIS) Service from Unity in Mixed Reality Apps

I’ve written quite a bit on this blog about speech interactions in the past and elsewhere like these articles that I wrote for the Windows blog a couple of years ago;

Using speech in your UWP apps- It’s good to talk

Using speech in your UWP apps- From talking to conversing

Using speech in your UWP apps- Look who’s talking

which came out of earlier investigations that I did for this blog like this post;

Speech to Text (and more) with Windows 10 UWP & ‘Project Oxford’

and we talked about Speech in our Channel9 show again a couple of years ago now;

image

and so I won’t rehash the whole topic here of speech recognition and understanding but in the last week I’ve been working on a fairly simple scenario that I thought I would share the code from.

Backdrop – the Scenario

The scenario involved a Unity application built against the “Stable .NET 3.5 Equivalent” scripting runtime which targets both HoloLens and immersive Windows Mixed Reality headsets where there was a need to use natural language instructions inside of the app.

That is, there’s a need to;

  1. grab audio from the microphone.
  2. turn the audio into text.
  3. take the text and derive the user’s intent from the spoken text.
  4. drive some action inside of the application based on that intent.

It’s fairly generic although the specific application is quite exciting but in order to get this implemented there’s some choices around technologies/APIs and whether functionality happens in the cloud or at the edge.

Choices

When it comes to (2) there’s a couple of choices in that there are layered Unity/UWP APIs that can make this happen and the preference in this scenario would be to use the Unity APIs which are the KeywordRecognizer and the DictationRecognizer for handling short/long chunks of speech respectively.

Those APIs are packaged so as to wait for a reasonable, configurable period of time for some speech to occur before delivering a ‘speech occurred’ type event to the caller passing the text that has been interpreted from the speech. 

There’s no cost (beyond on-device resources) to using these APIs and so in a scenario which only went as far as speech-to-text it’d be quite reasonable to have these types of APIs running all the time gathering up text and then having the app decide what to do with it.

However, when it comes to (3), the API of choice is LUIS which can take a piece of text like;

“I’d like to order a large pepperoni pizza please”

and can turn it into something like;

Intent: OrderPizza

Entity: PizzaType (Pepperoni)

Entity: Size (Large)

Confidence: 0.85

and so it’s a very useful thing as it takes the task of fathoming all the intricacies of natural language away from the developer.

This poses a bit of a challenge though for a ‘real time’ app in that it’s not reasonable to take every speech utterance that the user delivers and run it through the LUIS cloud service. There’s a number of reasons for that including;

  1. The round-trip time from the client to the service is likely to be fairly long and so, without care, the app would have many calls in flight leading to problems with response time and complicating the code and user experience.
  2. The service has a financial cost.
  3. The user may not expect or want all of their utterances to be run through the cloud.

Consequently, it seems sensible to have some trigger in an app which signifies that the user is about to say something that is of meaning to the app and which should be sent off to the LUIS service for examination. In short, it’s the;

“Hey, Cortana”

type key phrase that lets the system know that the user has something to say.

This can be achieved in a Unity app targeting .NET 3.5 by having the KeywordRecognizer class work in conjunction with the DictationRecognizer class such that the former listens for the speech keyword (‘hey, Cortana!’) and the latter then springs into life and listens for the dictated phrase that the user wants to pass on to the app.

As an aside, it’s worth flagging that these classes are only supported by Unity on Windows 10 as detailed in the docs and that there is an isSupported flag to let the developer test this at runtime.

There’s another aside to using these two classes together in that the docs here note that different types of recognizer cannot be instantiated at once and that they rely on an underlying PhraseRecognitionSystem and that the system has to be Shutdown in order to switch between one type of recognizer and another.

Later on in the post, I’ll return to the idea of making a different choice around turning speech to text but for the moment, I moved forward with the DictationRecognizer.

Getting Something Built

Some of that took a little while to figure out but once it’s sorted it’s “fairly” easy to write some code in Unity which uses a KeywordRecognizer to switch on/off a DictationRecognizer in an event-driven loop so as to gather dictated text.

I chose to have the notion of a DictationSink which is just something that receives some text from somewhere. It could have been an interface but I thought that I’d bring in MonoBehavior;

using UnityEngine;

public class DictationSink : MonoBehaviour
{
    public virtual void OnDictatedText(string text)
    {
    }
}

and so then I can write a DictationSource which surfaces a few properties from the underlying DictationRecognizer and passes on recognized text to a DictationSink;

using System;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class DictationSource : MonoBehaviour
{
    public event EventHandler DictationStopped;

    public float initialSilenceSeconds;
    public float autoSilenceSeconds;
    public DictationSink dictationSink;
   
    // TODO: Think about whether this should be married with the notion of
    // a focused object rather than just some 'global' entity.

    void NewRecognizer()
    {
        this.recognizer = new DictationRecognizer();
        this.recognizer.InitialSilenceTimeoutSeconds = this.initialSilenceSeconds;
        this.recognizer.AutoSilenceTimeoutSeconds = this.autoSilenceSeconds;
        this.recognizer.DictationResult += OnDictationResult;
        this.recognizer.DictationError += OnDictationError;
        this.recognizer.DictationComplete += OnDictationComplete;
        this.recognizer.Start();
    }
    public void Listen()
    {
        this.NewRecognizer();
    }
    void OnDictationComplete(DictationCompletionCause cause)
    {
        this.FireStopped();
    }
    void OnDictationError(string error, int hresult)
    {
        this.FireStopped();
    }
    void OnDictationResult(string text, ConfidenceLevel confidence)
    {
        this.recognizer.Stop();

        if ((confidence == ConfidenceLevel.Medium) ||
            (confidence == ConfidenceLevel.High) &&
            (this.dictationSink != null))
        {
            this.dictationSink.OnDictatedText(text);
        }
    }
    void FireStopped()
    {
        this.recognizer.DictationComplete -= this.OnDictationComplete;
        this.recognizer.DictationError -= this.OnDictationError;
        this.recognizer.DictationResult -= this.OnDictationResult;
        this.recognizer = null;

        // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
        // The challenge we have here is that we want to use both a KeywordRecognizer
        // and a DictationRecognizer at the same time or, at least, we want to stop
        // one, start the other and so on.
        // Unity does not like this. It seems that we have to shut down the 
        // PhraseRecognitionSystem that sits underneath them each time but the
        // challenge then is that this seems to stall the UI thread.
        // So far (following the doc link above) the best plan seems to be to
        // not call Stop() on the recognizer or Dispose() it but, instead, to
        // just tell the system to shutdown completely.
        PhraseRecognitionSystem.Shutdown();

        if (this.DictationStopped != null)
        {
            // And tell any friends that we are done.
            this.DictationStopped(this, EventArgs.Empty);
        }
    }
    DictationRecognizer recognizer;
}

notice in that code my attempt to use PhraseRecognitionSystem.Shutdown() to really stop this recognizer when I’ve processed a single speech utterance from it.

I need to switch this recognition on/off in response to a keyword being spoken by the user and so I wrote a simple KeywordDictationSwitch class which tries to do this using KeywordRecognizer with a few keywords;

using System.Linq;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class KeywordDictationSwitch : MonoBehaviour
{
    public string[] keywords = { "ok", "now", "hey", "listen" };
    public DictationSource dictationSource;

    void Start()
    {
        this.NewRecognizer();
        this.dictationSource.DictationStopped += this.OnDictationStopped;
    }
    void NewRecognizer()
    {
        this.recognizer = new KeywordRecognizer(this.keywords);
        this.recognizer.OnPhraseRecognized += this.OnPhraseRecgonized;
        this.recognizer.Start();
    }
    void OnDictationStopped(object sender, System.EventArgs e)
    {
        this.NewRecognizer();
    }
    void OnPhraseRecgonized(PhraseRecognizedEventArgs args)
    {
        if (((args.confidence == ConfidenceLevel.Medium) ||
            (args.confidence == ConfidenceLevel.High)) &&
            this.keywords.Contains(args.text.ToLower()) &&
            (this.dictationSource != null))
        {
            this.recognizer.OnPhraseRecognized -= this.OnPhraseRecgonized;
            this.recognizer = null;

            // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
            // The challenge we have here is that we want to use both a KeywordRecognizer
            // and a DictationRecognizer at the same time or, at least, we want to stop
            // one, start the other and so on.
            // Unity does not like this. It seems that we have to shut down the 
            // PhraseRecognitionSystem that sits underneath them each time but the
            // challenge then is that this seems to stall the UI thread.
            // So far (following the doc link above) the best plan seems to be to
            // not call Stop() on the recognizer or Dispose() it but, instead, to
            // just tell the system to shutdown completely.
            PhraseRecognitionSystem.Shutdown();

            // And then start up the other system.
            this.dictationSource.Listen();
        }
        else
        {
            Debug.Log(string.Format("Dictation: Listening for keyword {0}, heard {1} with confidence {2}, ignored",
                this.keywords,
                args.text,
                args.confidence));
        }
    }
    void StartDictation()
    {
        this.dictationSource.Listen();
    }
    KeywordRecognizer recognizer;
}

and once again I’m going through some steps to try and switch the KeywordRecognizer on/off here so that I can then switch the DictationRecognizer on/off as simply calling Stop() on a recognizer isn’t enough.

With this in place, I can now stack these components in Unity and have them use each other;

image

and so now I’ve got some code that listens for keywords, switches dictation on, listens for dictation and then passes that on to some DictationSink.

That’s a nice place to implement some LUIS functionality.

In doing so, I ended up writing perhaps more code than I’d liked as I’m not sure whether there is a LUIS library that works from a Unity environment targeting the Stable .NET 3.5 subset. I’ve found this to be a challenge with calling a few Azure services from Unity and LUIS doesn’t seem to be an exception in that there are client libraries on NuGet for most scenarios but I don’t think that they work in Unity (I could be wrong) and there aren’t generally examples/samples for Unity.

So…I rolled some small pieces of my own here which isn’t so hard when the call that we need here with LUIS is just a REST call.

Based on the documentation around the most basic “GET” functionality as detailed in the LUIS docs here,  I wrote some classes to represent the LUIS results;

using System;
using System.Linq;

namespace LUIS.Results
{
    [Serializable]
    public class QueryResultsIntent
    {
        public string intent;
        public float score;
    }
    [Serializable]
    public class QueryResultsResolution
    {
        public string[] values;

        public string FirstOrDefaultValue()
        {
            string value = string.Empty;
            
            if (this.values != null)
            {
                value = this.values.FirstOrDefault();
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResultsEntity
    {
        public string entity;
        public string type;
        public int startIndex;
        public int endIndex;
        public QueryResultsResolution resolution;

        public string FirstOrDefaultResolvedValue()
        {
            var value = string.Empty;

            if (this.resolution != null)
            {
                value = this.resolution.FirstOrDefaultValue();
            }

            return (value);
        }
        public string FirstOrDefaultResolvedValueOrEntity()
        {
            var value = this.FirstOrDefaultResolvedValue();

            if (string.IsNullOrEmpty(value))
            {
                value = this.entity;
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResults
    {
        public string query;
        public QueryResultsEntity[] entities;
        public QueryResultsIntent topScoringIntent;
    }
}

and then wrote some code to represent a Query of the LUIS service. I wrote this on top of pieces that I borrowed from my colleague, Dave’s, repo over here in github which provides some Unity compatible REST pieces with JSON serialization etc.

using LUIS.Results;
using RESTClient;
using System;
using System.Collections;

namespace LUIS
{
    public class Query
    {
        string serviceBaseUrl;
        string serviceKey;

        public Query(string serviceBaseUrl,
            string serviceKey)
        {
            this.serviceBaseUrl = serviceBaseUrl;
            this.serviceKey = serviceKey;
        }
        public IEnumerator Get(Action<IRestResponse<QueryResults>> callback)
        {
            var request = new RestRequest(this.serviceBaseUrl, Method.GET);

            request.AddQueryParam("subscription-key", this.serviceKey);
            request.AddQueryParam("q", this.Utterance);
            request.AddQueryParam("verbose", this.Verbose.ToString());
            request.UpdateRequestUrl();

            yield return request.Send();

            request.ParseJson<QueryResults>(callback);
        }        
        public bool Verbose
        {
            get;set;
        }
        public string Utterance
        {
            get;set;
        }
    }
}

and so now I can Query LUIS and get results back and so it’s fairly easy to put this into a DictationSink which passes the dictated speech in text form off to LUIS;

using LUIS;
using LUIS.Results;
using System;
using System.Linq;
using UnityEngine.Events;

[Serializable]
public class QueryResultsEventType : UnityEvent<QueryResultsEntity[]>
{
}

[Serializable]
public class DictationSinkHandler
{
    public string intentName;
    public QueryResultsEventType intentHandler;
}

public class LUISDictationSink : DictationSink
{
    public float minimumConfidenceScore = 0.5f;
    public DictationSinkHandler[] intentHandlers;
    public string luisApiEndpoint;
    public string luisApiKey;

    public override void OnDictatedText(string text)
    {
        var query = new Query(this.luisApiEndpoint, this.luisApiKey);

        query.Utterance = text;

        StartCoroutine(query.Get(
            results =>
            {
                if (!results.IsError)
                {
                    var data = results.Data;

                    if ((data.topScoringIntent != null) &&
                        (data.topScoringIntent.score > this.minimumConfidenceScore))
                    {
                        var handler = this.intentHandlers.FirstOrDefault(
                            h => h.intentName == data.topScoringIntent.intent);

                        if (handler != null)
                        {
                            handler.intentHandler.Invoke(data.entities);
                        }
                    }
                }
            }
        ));
    }
}

and this is really just a map which takes a look at the confidence score provided by LUIS, makes sure that it is high enough for our purposes and then looks into a map which maps between the names of the LUIS intents and a function which handles that intent set up here as a UnityEvent<T> so that it can be configured in the editor.

So, in use if I have some LUIS model which has intents named Create, DeleteAll and DeleteType then I can configure up an instance of this LUISDictationSink in Unity as below to map these to functions inside of a class named LUISIntentHandlers in this case;

image

and then a handler for this type of interaction might look something like;

    public void OnIntentCreate(LUIS.Results.QueryResultsEntity[] entities)
    {
        // We need two pieces of information here - the shape type and
        // the distance.
        var entityShapeType = entities.FirstOrDefault(e => e.type == "shapeType");
        var entityDistance = entities.FirstOrDefault(e => e.type == "builtin.number");

	// ...
    }

and this all works fine and completes the route that goes from;

keyword recognition –> start dictation –> end dictation –> LUIS –> intent + entities –> handler in code –> action

Returning to Choices – Multi-Language & Dictation in the Cloud

I now have some code that works and it feels like the pieces are in the ‘best’ place in that I’m running as much as possible on the device and hopefully only calling the cloud when I need to. That said, if I could get the capabilities of LUIS offline and run then on the device then I’d like to do that too but it’s not something that I think you can do right now with LUIS.

However, there is one limit to what I’m currently doing which isn’t immediately obvious and it’s that it is limited in terms of offering the possibility of non-English languages and, specifically, on HoloLens where (as far as I know) the recognizer classes only offer English support.

So, to support other languages I’d need to do my speech to text work via some other route – I can’t rely on the DictationRecognizer alone.

As an aside, it’s worth saying that I think multi-language support would need more work than just getting the speech to text to work in another language.

I think it would also require building a LUIS model in another language but that’s something that could be done.

An alternate way of performing speech-to-text that does support multiple languages would be to bring in a cloud powered speech to text API like the Cognitive Service Speech API and I could bring that into my code here by wrapping it up as a new type of DictationSource.

That speech-to-text API has some different ways of working. Specifically it can perform speech to text by;

  • Submitting an audio file in a specified format to a REST endpoint and getting back text.
  • Opening a websocket and sending chunks of streamed speech data up to the service to get back responses.

Of the two, the second has the advantage that it can be a bit smarter around detecting silence in the stream and it can also offer interim ‘hypotheses’ around what is being said before it delivers its ultimate view of what the utterance was. It can also support longer sections of speech than the file-based method.

So, this feels like a good way to go as an alternate DictationSource for my code.

However, making use of that API requires sending a stream of audio data to the cloud down a websocket in a format that is compatible with the service on the other end of the wire and that’s code I’d like to avoid writing. Ideally, it feels like the sort of code that one developer who was close to the service would write once and everyone would then re-use.

That work is already done if you’re using the service from .NET and you’re in a situation where you can make use of the client library that wrappers up the service access but I don’t think that it’s going to work for me from Unity when targeting the “Stable .NET 3.5 Equivalent” scripting runtime.

So…for this post, I’m going to leave that as a potential ‘future exercise’ that I will try to return back to if time permits and I’ll update the post if I do so.

In the meantime, here’s the code.

Code

If you’re interested in the code then it’s wrapped up in a simple Unity project that’s here on github;

http://github.com/mtaulty/LUISPlayground

That code is coupled to a LUIS service which has some very basic intents and entities around creating simple Unity game objects (spheres and cubes) at a certain distance in front of the user. It’s very rough.

There are three intents inside of this service. One is intended to create objects with utterances like “I want to create a cube 2 metres away”

image

and then it’s possible to delete everything that’s been created with a simple utterance;

image

and lastly it’s possible to get rid of just the spheres/cubes with a different intent such as “get rid of all the cubes”;

image

If you wanted to make the existing code run then you’d need an API endpoint and a service key for such a service and so I’ve exported the service itself from LUIS as a JSON export into this file in the repo;

image

so it should be possible to go to the LUIS portal and import that as a service;

image

and then plug in the endpoint and service key into the code here;

image

Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.

Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;

image

and in this article on the Windows blog;

Using speech in your UWP apps: It’s good to talk

and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event  handler taken from a fairly blank UWP application;


async void OnLoaded(object sender, RoutedEventArgs args)
    {
      using (var synth = new SpeechSynthesizer())
      {
        using (var mediaPlayer = new MediaPlayer())
        {
          TaskCompletionSource<bool> source = null;

          mediaPlayer.MediaEnded += (s, e) =>
          {
            source.SetResult(true);
          };
          for (int i = 0; i < 100; i++)
          {
            var speechText = $"This is message number {i + 1}";
            source = new TaskCompletionSource<bool>();

            using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText))
            {
              mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType);
              mediaPlayer.Play();              
            }
            await source.Task;
            await Task.Delay(1000);
          }
        }
      }
    }

Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.

However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.

I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;

Clicking sound during start and stop of audio playback

and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.

The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.

That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;

and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.

The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.

With that in mind, I tried to write code which would;

  1. Create a temporary file
  2. Create an audio graph consisting of a connection between
    1. An AudioFileInputNode representing my temporary file
    2. An AudioDeviceOutputNode for the default audio rendering device on the system
  3. Perform Text to Speech
  4. Write the resulting stream to the temporary file
  5. Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system

and my aim here was to avoid;

  1. having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
  2. having to create a separate temporary file for every piece of speech
  3. having to create an ever-growing temporary file containing all the pieces of speech concatenated together

and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.

In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.

The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      using (var speechSynthesizer = new SpeechSynthesizer())
      {
        var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media));

        if (graphResult.Status == AudioGraphCreationStatus.Success)
        {
          using (var graph = graphResult.Graph)
          {
            var outputResult = await graph.CreateDeviceOutputNodeAsync();

            if (outputResult.Status == AudioDeviceNodeCreationStatus.Success)
            {
              graph.Start();

              using (var outputNode = outputResult.DeviceOutputNode)
              {
                for (int i = 0; i < 100; i++)
                {
                  var speechText = $"This is message number {i + 1}";

                  await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile);

                  // TBD: I want to avoid this creating of 100 input file nodes but
                  // I don't seem (yet) to be able to get away from it so right now
                  // I keep creating new input nodes over the same file which changes
                  // every iteration of the loop.
                  var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile);

                  if (inputResult.Status == AudioFileNodeCreationStatus.Success)
                  {
                    using (var inputNode = inputResult.FileInputNode)
                    {
                      inputNode.AddOutgoingConnection(outputNode);
                      await inputNode.WaitForFileCompletedAsync();
                    }
                  }
                  await Task.Delay(1000);
                }
              }
              graph.Stop();
            }
          }
        }
      }
      await temporaryFile.DeleteAsync();
    }

and that code depends on a class that can create temporary files;

  public static class TemporaryFileCreator
  {
    public static async Task<StorageFile> CreateTemporaryFileAsync()
    {
      var fileName = $"{Guid.NewGuid()}.bin";

      var storageFile =
        await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName);

      return (storageFile);
    }
  }

and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;

 public static class SpeechSynthesizerExtensions
  {
    public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text)
    {
      var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      await SynthesizeTextToFileAsync(synthesizer, text, storageFile);

      return (storageFile);
    }
    public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file)
    {
      using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text))
      {
        using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite))
        {
          await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream);
        }
      }
    }
  }

and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;

  public static class AudioInputFileNodeExtensions
  {
    public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode)
    {
      TypedEventHandler<AudioFileInputNode, object> handler = null;
      TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>();

      handler = (s, e) =>
      {
        s.FileCompleted -= handler;
        completed.SetResult(true);
      };
      inputNode.FileCompleted += handler;

      await completed.Task;
    }
  }

This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;

  • Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
  • Get rid of that initial audible ‘pop’.

I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…

“Project Oxford”–Speaker Identification from a Windows 10/UWP App

Following up on this earlier post, I wanted to get a feel for what the speaker identification part of the “Project Oxford” speaker recognition APIs look like having toyed with verification in that previous post.

It’s interesting to see the difference between the capability of the two areas of functionality and how it shapes the APIs that the service offers.

For verification, a profile is built by capturing the user repeating one of a set of supported phrases 3 times over and submitting the captured audio. These are short phrases. Once the profile is built, the user can be prompted to submit a phrase that can be tested against the profile for a ‘yes/no’ match.

Identification is a different beast. The enrolment phase involves building a profile by capturing the user talking for 60 seconds and submitting it to the service for analysis. It’s worth saying that all 60 seconds don’t have to be at once but the minimum duration is 20 seconds.

The service then processes that speech and provides a ‘call me back’ style endpoint which the client must poll to later gather the results. It’s possible that the results of processing will be a request for more speech to analyse in order to complete the profile and so there’s a possibility of looping to build the profile.

Once built, the identification phase is achieved by submitting another 60 seconds of the user speaking along with (at the time of writing) a list of up to 10 profiles to check against.

So, while it’s possible to build up to 1000 profiles at the service, identification only runs against 10 of them at a time right now.

Again, this submission results in a ‘call me back’ URL which the client can return to later for results.

Clearly, identification is a much harder problem to solve than verification and it’s reflected in the APIs here although I suspect that, over time, the amount of speech required and the number of profiles that can be checked in one call will change.

In terms of actually calling the APIs, it would be worth referring back to my previous post because it talked about where to find the official (non UWP) samples and has links across to the “Oxford” documentation whereas what I’m doing here is adapting my previous code to work with the identification APIs rather than the verification ones.

In doing that, I made my little test app speech-centric rather than mouse/keyboard centric and it ended up working as shown in the video below (NB: this video has 2+ minutes of me reading from a script on the web, feel free to jump around to skip those bits Smile);

In most of my tests, I found that I had to submit more than 1 batch of speech as part of the enrolment phase but I got a little lucky with this example that I recorded and enrolment happened in one go which surprised me.

Clearly, I’d need to go gather a slightly larger user community for this than 1 person to get a better test on it but it seems like it’s working reasonably here.

I’ve posted the code for this here for download – it’s fairly rough-and-ready and there’s precious little error handling in there plus it’s more of a code-behind sample lacking in much structure.

As before, if you want to build this out yourself you’ll need an API key for the “Oxford” API and you’ll need it to get the file named keys.cs to compile.