Speech to Text (and more) with Windows 10 UWP & ‘Project Oxford’

We’re increasingly talking to machines and, more importantly, they’re increasingly listening and even starting to understand.

In the world of the Windows 10 Universal Windows Platform, there’s the SpeechRecognizer class which can do speech recognition on the device without necessarily calling off to the cloud and it has a number of different capabilities.

The recognizer can be invoked either to recognise a discrete ‘piece’ of speech at a particular point in time or it can be invoked to continuously listen to speech. For the former case, it can also show standard UI that the user would recognise across apps or it can show custom UI of the developer’s choice (or none at all if that’s appropriate).

As a simple example, here’s a piece of code that sets a few options around timeouts, UI options and displays the system UI to recognise speech – assume that it’s invoked from a Button and that there is a TextBlock called txtResults to store the results into;

  public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
    }
    async void OnListenAsync(object sender, RoutedEventArgs e)
    {
      this.recognizer = new SpeechRecognizer();
      await this.recognizer.CompileConstraintsAsync();

      this.recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(5);
      this.recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromSeconds(20);

      this.recognizer.UIOptions.AudiblePrompt = "Say whatever you like, I'm listening";
      this.recognizer.UIOptions.ExampleText = "The quick brown fox jumps over the lazy dog";
      this.recognizer.UIOptions.ShowConfirmation = true;
      this.recognizer.UIOptions.IsReadBackEnabled = true;
      this.recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(5);

      var result = await this.recognizer.RecognizeWithUIAsync();

      if (result != null)
      {
        StringBuilder builder = new StringBuilder();

        builder.AppendLine(
          $"I have {result.Confidence} confidence that you said [{result.Text}] " +
          $"and it took {result.PhraseDuration.TotalSeconds} seconds to say it " +
          $"starting at {result.PhraseStartTime:g}");

        var alternates = result.GetAlternates(10);

        builder.AppendLine(
          $"There were {alternates?.Count} alternates - listed below (if any)");

        if (alternates != null)
        {
          foreach (var alternate in alternates)
          {
            builder.AppendLine(
              $"Alternate {alternate.Confidence} confident you said [{alternate.Text}]");
          }
        }
        this.txtResults.Text = builder.ToString();
      }
    }
    SpeechRecognizer recognizer;
  }

and here’s that code running;

and it’s easy to take away the UIOptions and swap the call to RecognizeWithUIAsync() to be a call to RecognizeAsync() in order to have the system drop the UI and, perhaps, provide my own UI in the form of a ProgressBar. Here’s that code;

 public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
    }
    async void OnListenAsync(object sender, RoutedEventArgs e)
    {
      this.recognizer = new SpeechRecognizer();
      await this.recognizer.CompileConstraintsAsync();

      this.recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(5);
      this.recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromSeconds(20);
      this.recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(5);

      this.txtResults.Text = string.Empty;

      this.progressBar.Visibility = Visibility.Visible;

      var result = await this.recognizer.RecognizeAsync();

      if (result != null)
      {
        StringBuilder builder = new StringBuilder();

        builder.AppendLine(
          $"I have {result.Confidence} confidence that you said [{result.Text}] " +
          $"and it took {result.PhraseDuration.TotalSeconds} seconds to say it " +
          $"starting at {result.PhraseStartTime:g}");

        var alternates = result.GetAlternates(10);

        builder.AppendLine(
          $"There were {alternates?.Count} alternates - listed below (if any)");

        if (alternates != null)
        {
          foreach (var alternate in alternates)
          {
            builder.AppendLine(
              $"Alternate {alternate.Confidence} confident you said [{alternate.Text}]");
          }
        }
        this.txtResults.Text = builder.ToString();
      }
      this.progressBar.Visibility = Visibility.Collapsed;
    }
    SpeechRecognizer recognizer;
  }

and here it is running although I think it’s fair to say that my “UI” is not really giving the user much of a clue around what they are expected to do here Winking smile

Unless I’m building a dictation program, I might want to guide the speech recognition engine and use more of a ‘command’ mode.

For instance, if I wanted to start implementing the classic interactive programming environment of Logo with voice control then I might want commands like “left” or “right” or something along those lines. I’ve dropped an image into my UI which has a RotateTransform applied to it called rotateTransform and I’ve changed my code;

  public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
    }
    async void OnListenAsync(object sender, RoutedEventArgs e)
    {
      this.recognizer = new SpeechRecognizer();

      var commands = new Dictionary<string, int>()
      {
        ["left"] = -90,
        ["right"] = 90
      };

      this.recognizer.Constraints.Add(new SpeechRecognitionListConstraint(
        commands.Keys));

      await this.recognizer.CompileConstraintsAsync();

      var result = await this.recognizer.RecognizeAsync();

      if ((result != null) && (commands.ContainsKey(result.Text)))
      {
        this.rotateTransform.Angle += commands[result.Text];
      }
    }
    SpeechRecognizer recognizer;
  }

and that lets me spin the turtle;

but I probably want to move away from having to press a button every time I want the SpeechRecognizer to listen to me and use more of a ‘continuous’ speech recognition session.

I can do that while preserving my constrained set of commands that are being listened for;

  public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
      this.Loaded += OnLoaded;
    }

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      this.recognizer = new SpeechRecognizer();

      var commands = new Dictionary<string, int>()
      {
        ["left"] = -90,
        ["right"] = 90
      };

      this.recognizer.Constraints.Add(new SpeechRecognitionListConstraint(
        commands.Keys));

      await this.recognizer.CompileConstraintsAsync();

      this.recognizer.ContinuousRecognitionSession.ResultGenerated +=
        async (s, e) =>
        {
          if ((e.Result != null) && (commands.ContainsKey(e.Result.Text)))
          {
            await this.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
              () =>
              {
                this.rotateTransform.Angle += commands[e.Result.Text];
              }
            );
            this.recognizer.ContinuousRecognitionSession.Resume();
          }
        };

      await this.recognizer.ContinuousRecognitionSession.StartAsync(
        SpeechContinuousRecognitionMode.PauseOnRecognition);
    }
    SpeechRecognizer recognizer;
  }

and now the turtle just ‘obeys’ without me having to press any buttons etc;

but it might be nice to accept a more natural form of input here where the user might say something like;

“make the turtle turn left”

where the only bit of the speech that the code is really interested in is the “left” part but the presentation to the user is more natural.

This is perhaps where we can start to guide the recogniser with a bit of a grammar and the engine here understands SRGS (there’s a better guide here on MSDN).

I made a simple grammar;

<?xml version="1.0" encoding="UTF-8"?>
<grammar 
  version="1.0" mode="voice" root="commands"
  xml:lang="en-US" tag-format="semantics/1.0"  
  xmlns="http://www.w3.org/2001/06/grammar">
  <rule id="commands" scope="public">
    <item>make the turtle turn</item>
    <ruleref uri="#direction" />
    <tag> out.rotation=rules.latest(); </tag>
  </rule>
  <rule id="direction">
    <one-of>
      <item>
        <tag>out="left";</tag>
        <one-of>
          <item>left</item>
          <item>anticlockwise</item>
          <item>banana</item>
        </one-of>
      </item>
      <item>
        <tag>out="right";</tag>
        <one-of>
          <item>right</item>
          <item>clockwise</item>
          <item>cheese</item>
        </one-of>
      </item>
    </one-of>
  </rule>
</grammar>

and added it to my project as a file grammar.xml and then changed my code a little – I now have 3 terms in the grammar defined for “left”/”right” but in the code I only need to deal with “left” and “right” coming from the engine;

 public sealed partial class MainPage : Page
  {
    public MainPage()
    {
      this.InitializeComponent();
      this.Loaded += OnLoaded;
    }

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      this.recognizer = new SpeechRecognizer();

      var grammarFile = await StorageFile.GetFileFromApplicationUriAsync(
        new Uri("ms-appx:///grammar.xml"));

      this.recognizer.Constraints.Add(
        new SpeechRecognitionGrammarFileConstraint(grammarFile));

      await this.recognizer.CompileConstraintsAsync();

      this.recognizer.ContinuousRecognitionSession.ResultGenerated +=
        async (s, e) =>
        {
          var rotationList = e.Result?.SemanticInterpretation?.Properties?["rotation"];
          var rotation = rotationList.FirstOrDefault();

          if (!string.IsNullOrEmpty(rotation))
          {
            var angle = 0;

            switch (rotation)
            {
              case "left":
                angle = -90;
                break;
              case "right":
                angle = 90;
                break;
              default:
                break;
            }
            await this.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
              () =>
              {
                this.rotateTransform.Angle += angle;
              });
          }
          this.recognizer.ContinuousRecognitionSession.Resume();
        };

      await this.recognizer.ContinuousRecognitionSession.StartAsync(
        SpeechContinuousRecognitionMode.PauseOnRecognition);
    }
    SpeechRecognizer recognizer;
  }

and then I can try that out;

and that seems to work pretty well.

That’s a number of different options that can all be done on the client device without coding against a cloud service but, of course, those cloud services are out there and are cross-platform so let’s try some of that functionality out…

Adding Cloud

In terms of the services that I want to add to the post here, I’m looking at what’s available under the banner of ‘Project Oxford’ and specifically the Speech APIs and the Speaker Recognition APIs.

All of these are in preview right now and there’s a need to grab an API key to make use of them which you can do on the site itself.

Speech Recognition

There’s a page here that puts the Windows 10 and ‘Oxford’ speech capabilities all on one page but I must admit that I find it quite confusing as the examples given seem to be more about the UWP SpeechRecognizer than about ‘Oxford’.

With the right couple of clicks though you can jump to the class SpeechRecognitionServiceFactory in the docs which seems to provide the starting point for Oxford-based speech recognition.

That said, for a UWP developer, there’s a bit of a blocker at the time of writing as the Oxford client libraries don’t have UWP support Sad smile

Github thread about UWP support

and so you either need to drop down to the desktop and code for WPF or Windows Forms or you can make calls to the REST API yourself without a client library.

My choice was to go with WPF given that it looks like UWP support is on its way and (I’d imagine) the APIs will end up looking similar for UWP.

So, I made a WPF application and added in the NuGet package;

image

and I went back to my original scenario for speech recognition of some freeform speech and made a UI that with a Start button, a Stop button and a TextBlock to display the recognition results and that gave me some code that looked like this;

  public partial class MainWindow : Window
  {
    public MainWindow()
    {
      InitializeComponent();
    }

    private void OnStart(object sender, RoutedEventArgs e)
    {
      this.client = SpeechRecognitionServiceFactory.CreateMicrophoneClient(
        SpeechRecognitionMode.ShortPhrase,
        "en-GB",
        Constants.Key);

      this.client.OnPartialResponseReceived += OnPartialResponse;
      this.client.OnResponseReceived += OnResponseReceived;
      this.client.OnConversationError += Client_OnConversationError;

      this.client.StartMicAndRecognition();
    }
    private void Client_OnConversationError(object sender, SpeechErrorEventArgs e)
    {
      this.Dispatch(() =>
      {
        this.txtResults.Text = $"Some kind of problem {e.SpeechErrorText}";
      });
    }
    void OnResponseReceived(object sender, SpeechResponseEventArgs e)
    {
      if (e.PhraseResponse.RecognitionStatus != RecognitionStatus.RecognitionSuccess)
      {
        this.Dispatch(() =>
        {
          this.txtResults.Text = $"Some kind of problem {e.PhraseResponse.RecognitionStatus}";
        });
      }
      else
      {
        StringBuilder builder = new StringBuilder();

        foreach (var response in e.PhraseResponse.Results)
        {
          builder.AppendLine(
            $"We have [{response.Confidence}] confidence that you said [{response.DisplayText}]");
        }
        this.Dispatch(() =>
        {
          this.txtResults.Background = Brushes.LimeGreen;
          this.txtResults.Text = builder.ToString();
        });
      }
    }

    void OnPartialResponse(object sender, PartialSpeechResponseEventArgs e)
    {
      this.Dispatch(() =>
      {
        this.txtResults.Background = Brushes.Orange;
        this.txtResults.Text = $"Partial result: {e.PartialResult}";
      });
    }
    void Dispatch(Action a)
    {
      this.Dispatcher.Invoke(a);
    }

    private void OnStop(object sender, RoutedEventArgs e)
    {
      this.client.EndMicAndRecognition();
      this.client.Dispose();
    }
    MicrophoneRecognitionClient client;
  }

and that produces results that look like;

One of the interesting things here is that there is a sibling API to the CreateMicrophoneClient API called CreateDataClient and that divorces the voice capture from the speech recognition such that you can bring your own voice streams/files to the API – that’s a nice thing and it’s not there in the UWP APIs as far as I know.

In the code above, I’m using the speech mode of ShortPhrase which limits me to 20 seconds of speech but there is also the  long dictation mode which allows for up to 2 minutes.

The functionality here feels somewhere between the UWP SpeechRecognizer’s two modes of continuous recognition and discrete recognition – I call Start and then I get events fired and then I call End.

Unlike the UWP APIs, I don’t think there’s a way to guide these APIs in their recognition either using a constraint list or using an SRGS grammar but there is another piece of powerful functionality lurking here and that’s the ability to use a cloud service to analyse the intent of the language being used.

Adding Intent – LUIS

If I substitute the call that I made to CreateMicrophoneClient with one to CreateMicrophoneClientWithIntent then this allows me to link up with a language ‘model’ that has been built using the Language Understanding Intelligent Service (or LUIS).

There’s a video on how this works up here on the web;

LUIS Video

and I certainly found that to be a good thing to watch as I hadn’t tried out LUIS before.

Based on a couple of minutes of watching that video, I made myself a basic LUIS model based around my previous example of a Logo turtle and wanting to turn it left, right, clockwise, anti-clockwise and so on.

I only have one entity in the model (the turtle) and I have a couple of intents (left, right and none) and I published that model and grabbed the application Id and key such that I could insert it here into the client-side code;

using Microsoft.ProjectOxford.SpeechRecognition;
using Newtonsoft.Json.Linq;
using System;
using System.Text;
using System.Windows;
using System.Windows.Media;
using System.Linq;

namespace WpfApplication26
{
  public partial class MainWindow : Window
  {
    public MainWindow()
    {
      InitializeComponent();
    }

    private void OnStart(object sender, RoutedEventArgs e)
    {
      this.client = SpeechRecognitionServiceFactory.CreateMicrophoneClientWithIntent(
        "en-GB",
        Constants.Key,
        Constants.LuisAppId,
        Constants.LuisKey);

      this.client.OnIntent += OnIntent;
      this.client.OnPartialResponseReceived += OnPartialResponse;
      this.client.OnResponseReceived += OnResponseReceived;
      this.client.OnConversationError += Client_OnConversationError;

      this.client.StartMicAndRecognition();
    }
    void OnIntent(object sender, SpeechIntentEventArgs args)
    {
      this.Dispatch(() =>
      {
        this.txtIntent.Text = args.Intent?.Body;
      });

      // We have to parse this as JSON. Ho Hum.
      var result = JObject.Parse(args.Intent?.Body);
      var intents = (JArray)result["intents"];
      var entities = (JArray)result["entities"];
      var topIntent = intents.OrderByDescending(i => (float)i["score"]).FirstOrDefault();
      var topEntity = entities.OrderByDescending(e => (float)e["score"]).FirstOrDefault();

      if ((topIntent != null) && (topEntity != null))
      {
        var entityName = (string)topEntity["entity"];
        var intentName = (string)topIntent["intent"];

        if (entityName == "turtle")
        {
          int angle = 0;

          switch (intentName)
          {
            case "left":
              angle = -90;
              break;
            case "right":
              angle = 90;
              break;
            default:
              break;
          }
          this.Dispatch(() =>
          {
            this.rotateTransform.Angle += angle;
          });
        }
      }
    }

    private void Client_OnConversationError(object sender, SpeechErrorEventArgs e)
    {
      this.Dispatch(() =>
      {
        this.txtResults.Text = $"Some kind of problem {e.SpeechErrorText}";
      });
    }
    void OnResponseReceived(object sender, SpeechResponseEventArgs e)
    {
      if (e.PhraseResponse.RecognitionStatus != RecognitionStatus.RecognitionSuccess)
      {
        this.Dispatch(() =>
        {
          this.txtResults.Text = $"Some kind of problem {e.PhraseResponse.RecognitionStatus}";
        });
      }
      else
      {
        StringBuilder builder = new StringBuilder();

        foreach (var response in e.PhraseResponse.Results)
        {
          builder.AppendLine(
            $"We have [{response.Confidence}] confidence that you said [{response.DisplayText}]");
        }
        this.Dispatch(() =>
        {
          this.txtResults.Background = Brushes.LimeGreen;
          this.txtResults.Text = builder.ToString();
        });
      }
    }

    void OnPartialResponse(object sender, PartialSpeechResponseEventArgs e)
    {
      this.Dispatch(() =>
      {
        this.txtResults.Background = Brushes.Orange;
        this.txtResults.Text = $"Partial result: {e.PartialResult}";
      });
    }
    void Dispatch(Action a)
    {
      this.Dispatcher.Invoke(a);
    }

    private void OnStop(object sender, RoutedEventArgs e)
    {
      this.client.EndMicAndRecognition();
      this.client.Dispose();
    }
    MicrophoneRecognitionClient client;
  }
}

and the key functionality here is in the OnIntent handler where the service returns a bunch of JSON telling me which intents and entities that it thinks that it has detected and with what confidence values.

I can then act upon that information to try and rotate the turtle as demonstrated in the video below;

In some ways, using LUIS here feels a little like using a much less formal SRGS grammar with the SpeechRecognizer on-device in the UWP but the promise of LUIS here is that it can be trained such that, over time, it gets presented with more “utterances” (i.e. speech) and the administrator can go in and mark them up so as to improve the model.

Speaker Recognition

Something that’s definitely not present in the UWP is the ‘Project Oxford’ ability to try and identify a user from their voice in one of two modes;

    1. Confirm that a speaker is a ‘voice match’ with a previously enrolled speaker.
    2. Identify a speaker by ‘voice match’ against some set of previously enrolled speakers.

This all feels very ‘James Bond’ Smile and I’m keen to try it out but I’ll push it to a follow on post because this post is getting long and, as far as I can tell, there’s no client library wrapper here to dress up the REST calls so it’s likely to need a little bit of work.

Customising Language & Acoustic Models

‘Project Oxford’ also offers the possibility of customising language and acoustic models via its CRIS service but this one is “private preview” right now and I’m not on that preview so there’s nothing I can say there other than that I like these acronyms and I hope that there is a ‘Louis’ and a ‘Chris’ on the ‘Project Oxford’ team somewhere Smile

Summary

This became a long post and it’s mostly ‘Hello World’ type stuff but there’s some things in here about speech recognition working on-device in the UWP case and then off-device (and cross platform) in the ‘Project Oxford’ case and I think the main point that I’d make is that these technologies are very much ‘out there’ and very much commodity and yet I don’t see a lot of apps using them (yet) to great advantage.

Naturally, there’s also Cortana and I’ve written about ‘her’ before on the blog but that’s another, related aspect to what I’ve written here.

7 thoughts on “Speech to Text (and more) with Windows 10 UWP & ‘Project Oxford’

  1. Perfect timing! I am just starting a voice intensive app and was wondering where to start.

    Thanks very much for this one.

    1. Trying out the initial code in the first example, if I try it three times in a row, it immediately returns with a rejected response. Doesn’t matter if I use the UI or not. Did that happen to anyone else?

  2. Fantastic post with lots of useful examples. I create a proof of concept for an app my wife said she’d fine useful for work, using the UWP Speech Recognizer.

    Before this I’d use Speech Recognition in a WPF Kinect (v1) project a good few years back, we used a Grammar for that to make the commands more flexible, but I haven’t done anything similar for quite a while now.

    I’m particularly interested in exploring the Project Oxford speech recognition to achieve the best possible accuracy for free form dictation, I may re-visit the POC I created that did the recognition locally and try sending that speech via Oxford.

Comments are closed.