Windows Phone 8.1, Continuing a Conversation Started with Cortana

I wrote a little in this previous blog post about the basics of integrating Cortana into a Windows Phone 8.1 application with my specific example at the time being one of using the Bing synonym API service to bring back synonyms for a word that the user had spoken.

Ultimately, this boiled down to being able to hold down the search button on the phone to wake up Cortana and then being able to say something like;

“SYNONYMS FOR {WORD}”

where {WORD} is the placeholder for whatever word the user wanted to get a list of synonyms for and the app then goes off and hits the Bing synonym API to get that list and display it in front of the user.

That’s all good and the platform does 99% of the work for me in getting that done but there’s a scenario where I might want to continue the conversation.

As an example, I might want to have an app that took speech like;

“PICTURES OF {NOUN}” ( phrase 1 )

where {NOUN} is the subject of the pictures that the user wants to search for – for example, “cars” – but they may want to carry on this interaction with something like;

“EXCLUDE {FILTER}” ( phrase 2 )

where {FILTER} is the set of results that they want to take out of what’s already been brought back – for example, “red cars”.

It’s something that I’ve been wanting to experiment with but it has a number of implications including;

  1. How long the app continues to listen after ( phrase 1 ) above for a potential follow-on ( phrase 2 ) given that there’s more than likely a cost to listening in terms of battery usage and also given that if the user puts their phone down on the desk for 5 minutes then they are probably not expecting that the app is still listening to whatever they are doing.
  2. The UX – if the app is to continue listening then it probably needs some kind of visual cue to the user that it is still listening and also perhaps some way of cancelling and retrying that listening.
  3. The platform – i.e. I don’t believe that the pieces for interaction with Cortana currently handle this type of follow-on scenario. Cortana can be used to launch the app with voice-provided parameters but she doesn’t then provide a way of continuing a conversation once that initial interaction has taken place. Any follow-on dialog is going to have to work a different way.

In order to experiment with this, I put together a little app using my standard practise of searching for photos on the flickR online service. I thought the easiest way to explain the app would be to show a little screen capture of it running so here it is below;

While it’s quite “rough and ready”, it has just enough functionality to represent this idea of functionality that might be initiated by speech and then followed up with speech, i.e.

  • “FLICKR SHOW ME PICTURES OF {TOPIC}”

to launch the application but then I might want to follow that on with;

  • “SHOW ME {SUBTOPIC}”

as a simple example of constraining the results that have been brought back but (as far as I know) Cortana doesn’t give me that functionality so I need to take different steps.

One Thing to Flag…

One thing I’d flag because it always hits me when working with any of these speech recognition bits is this paragraph taken directly from MSDN;

To activate an app and initiate an action using a voice command, the app must register a VCD file that contains a CommandSet with a language that matches the speech language that the user selected on the phone. This language is set by the user on the phone's Settings > System > Speech > Speech Language screen.

I often find that I bang up against having some language/culture mismatch between my phone, the defaults for my app and then the specific language that I’ve put into some kind of speech/voice file and I also often find that the speech APIs don’t tell me this so I spend a long time trying to debug something that’s as easy as changing a “en-US” string to an “en-GB” string or vice versa. I’m flagging that up here in case you do the same.

Continuing the Conversation

When the app lands on the page that displays a flip view control of search results, I’m going to need to do some manual speech recognition in order to accept filtering commands from the user.

Firstly, I should say that there’s proper guidance around integration of speech like this into phone applications on MSDN;

Speech Design Guidelines

Responding to Speech Interactions

and that details a few areas;

  • Voice Commands – supported by the system and accessed by the user from outside your app.
  • Speech Recognition – implemented in your app.
  • Text to Speech – more about your app talking to the user rather than listening to them.

and so I’m going to want to do speech recognition inside of my app here and I can use SpeechRecognizer which works on Windows Phone 8.1 to carry on an interaction that was initiated by talking to Cortana.

Side-Bar: Trying to Mix Grammer + Dictation

At this point I’ll admit that I dropped into a 4-5 hour “black hole” around trying to get what I wanted to work with the SpeechRecognizer.

What I want is to tell the speech recognition that it should listen for the user saying;

“SHOW ME {SUBTOPIC}”

with the intent being that the “[TO]” part of this is optional and the “{SUBTOPIC}” part of this is freely dictated text that I need reporting back to me in my code.

I want to guide that speech recognition such that it listens for “SHOW ME {SUBTOPIC}” and returns to me the value of “{SUBTOPIC}” so that I can pick it up in my code.

In terms of that guiding of the speech recognition, there’s the notion of constraints and I found that a little confusing because I see 4 different types of constraints here;

  • Topic
  • List
  • Grammar
  • Voice Command Definition

and the last one is clearly something that works on Windows Phone 8.1 because that’s the way that I feed commands for the system to use via Cortana in the first place.

Having already authored one VCD file to guide Cortana integration, I was more than happy to write another, even shorter, one to guide the SpeechRecognizer and yet I couldn’t seem to get it done.

The route seems to be to feed a SpeechRecognitionVoiceCommandDefinitionConstraint in to the SpeechRecognizer but I don’t see any way that I can use that class to load up a voice command file or some XML that represents voice commands or similar outside of the scenario where I’m registering a VCD for Cortana integration. That seems to be confirmed on MSDN;

Adding and Compiling Constraints

which mentions only 3 ways of guiding the speech recognizer and excludes the Voice Command Definition which leaves me thinking that I can’t use a VCD file in this scenario and it’s perhaps only there to support the system integration via Cortana.

So, no using a VCD file here in so far as I can see and that lines up with that previous text about voice commands only being used from outside your app.

I don’t have a pre-defined list and nor do I want to do a completely free-form web search or dictation and so the next best option here seemed to be to use a grammar to indicate that I want to listen for “SHOW ME {SUBTOPIC}” and pick up the “{SUBTOPIC}” in my code.

Again, trying to get this to work wasn’t nearly as simple as I’d hoped.

I tried to come up with a SRGS grammar that mixed structure + dictation and ended up having to dig into the SRGS spec and this link which suggested that I might not be able to use SRGS to mix/match dictation and grammar-based recognition.

I came to that second link though from this StackOverflow post which suggested using the dictation value as a Uri but I never managed to come up with a way of getting this to work with the SpeechRecognizer on the Windows Phone 8.1. I’d always get compilation errors from the class at the point where I asked it to compile up any constraint referencing a SRGS file with that form of “dictation” in it and so I’m not sure whether this is supported on Windows Phone 8.1 or not.

I also tried to make use of a grammar that took this sort of form;

  <rule id="main" scope="public">
    <item>
      filter
      <ruleref special="GARBAGE" />
      <tag>out.FilterText=rules.latest().text;</tag>
    </item>
  </rule>

this was my attempt to tell the recognizer that I wanted the word “FILTER” followed by any speech at all that it could recognize and it “kind of worked” but there’s an irony in that I couldn’t find any way of grabbing hold of the “GARBAGE” portion of the recognised speech from my code and I guess that might be because the recognizer sees little value in returning something that’s been identified as “GARBAGE” so I’m not sure whether that’s feasible or not.

I backed off on that and hunted down another post on StackOverflow which suggested using a SAPI grammar but I’m not sure whether that’s feasible with the recognition bits on Phone 8.1 as they only mention using SRGS and I didn’t have any success trying out a SAPI grammar.

At this point, I gave up and decided to stop trying to “do it right” and focus more on “just doing it”.

Getting Speech Recognised

The SpeechRecognizer is pre-configured to do natural language dictation recognition so, strictly speaking, I don’t need to constrain it in order to pick up a phrase such as “SHOW ME {SUBTOPIC}”. Instead, I can just tell it to recognise and then have a look at what has been heard after the event and see if I can unpick the string myself. It’s less elegant but less complex.

It’s worth saying that in order to do this, the device needs to have network connectivity but, in my little scenario, that’s fair because Cortana recognition needs connectivity anyway.

In using the SpeechRecognizer, it’s possible to let it control the UI in which case it has options around whether to display confirmation text, what example text to display and so on but I thought that for my pretend scenario that UI was a bit too intrusive so I went with my own UI that I knocked together into a user control for my app that I called SpeechControl.

That control has a small UI with a 3 visual states which represent what it is doing at any one time – shown below in Blend;

image

image

image

The idea here is that the control would usually sit in the “Waiting” state and if the user taps into the TextBox and enters text, it will transition into the “Editing” state and let them type their query and hit the arrow button.

However, if the control is told to “listen for speech” it will move to the “Listening” state where it will listen for some speech for a period of time before giving up and going back to the “Waiting” state which it can also do if the “cancel” button is hit during the waiting period.

The control I build out looks like this from a XAML perspective. It was built quickly so it’s not going to be perfect by any means;

<UserControl x:Class="FlickrConversation.Controls.SpeechControl"
             xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
             xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
             xmlns:local="using:FlickrConversation.Views"
             xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
             xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
             mc:Ignorable="d"
             d:DesignHeight="100"
             d:DesignWidth="300"
             x:Name="control"
             xmlns:cmn="using:FlickrConversation.Common">
  <UserControl.Resources>
    <Color x:Key="LocalBackgroundBrush">#FF181C18</Color>
  </UserControl.Resources>

  <Grid Height="Auto">
    <VisualStateManager.VisualStateGroups>
      <VisualStateGroup x:Name="ListeningStates">
      	<VisualState x:Name="Waiting"/>
      	<VisualState x:Name="Listening">
      		<Storyboard>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="TypingTextBox">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Collapsed</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="SpeechButton">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Collapsed</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="SpeechGrid">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Visible</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="CancelListenButton">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Visible</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      		</Storyboard>
      	</VisualState>
      	<VisualState x:Name="Editing">
      		<Storyboard>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="SpeechButton">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Collapsed</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      			<ObjectAnimationUsingKeyFrames Storyboard.TargetProperty="(UIElement.Visibility)" Storyboard.TargetName="ArrowButton">
      				<DiscreteObjectKeyFrame KeyTime="0">
      					<DiscreteObjectKeyFrame.Value>
      						<Visibility>Visible</Visibility>
      					</DiscreteObjectKeyFrame.Value>
      				</DiscreteObjectKeyFrame>
      			</ObjectAnimationUsingKeyFrames>
      		</Storyboard>
      	</VisualState>
      </VisualStateGroup>
    </VisualStateManager.VisualStateGroups>
    <Grid.Background>
      <SolidColorBrush Color="{StaticResource LocalBackgroundBrush}" />
    </Grid.Background>
    <Grid.ColumnDefinitions>
      <ColumnDefinition />
      <ColumnDefinition Width="40" />
    </Grid.ColumnDefinitions>
    <Grid.RowDefinitions>
      <RowDefinition Height="Auto" />
    </Grid.RowDefinitions>
    <TextBox x:Name="TypingTextBox"
             Grid.Column="0"
             Margin="5"
             VerticalAlignment="Center"
             Text="{Binding Text,ElementName=control,Mode=TwoWay,UpdateSourceTrigger=Explicit}"
             TextChanged="OnTextChanged">

    </TextBox>
    <Grid x:Name="SpeechGrid"
          Visibility="Collapsed">
      <Grid.Background>
        <SolidColorBrush Color="{StaticResource LocalBackgroundBrush}" />
      </Grid.Background>
      <StackPanel VerticalAlignment="Center"
                  Margin="10,0,0,0">
        <TextBlock Text="Listening..."
                   Style="{StaticResource BaseTextBlockStyle}"
                   Foreground="{StaticResource PhoneAccentBrush}"
                   VerticalAlignment="Center" />
        <TextBlock Text="{Binding ExampleText,ElementName=control}"
                   Style="{StaticResource BaseTextBlockStyle}"
                   Foreground="Gray"
                   FontSize="12"/>
      </StackPanel>
    </Grid>
    <Button x:Name="ArrowButton"
            Grid.Column="1"
            MinWidth="0"
            Margin="5"
            Template="{x:Null}"
            Command="{Binding ElementName=control,Path=SubmitCommand,Mode=TwoWay}"
            VerticalAlignment="Center" Visibility="Collapsed">
      <Image Source="ms-appx:///Assets/rightarrow.png"
             Stretch="None" />
    </Button>
    <Button x:Name="SpeechButton"
            Grid.Column="1"
            MinWidth="0"
            Margin="5,5.25,0,5.25"
            Template="{x:Null}"
            VerticalAlignment="Center"
            Click="OnSpeechButton">
      <Image Source="ms-appx:///Assets/mic.png"
             Stretch="None" />
    </Button>
    <Button x:Name="CancelListenButton"
            Grid.Column="1"
            MinWidth="0"
            Margin="5,5.25,0,5.25"
            Template="{x:Null}"
            VerticalAlignment="Center"
            Visibility="Collapsed"
            Click="OnCancelSpeechButton">
      <Image Source="ms-appx:///Assets/cross.png"
             Stretch="None" />
    </Button>
  </Grid>
</UserControl>

and so perhaps the interesting thing here is that the controls inside of the user control here bind to properties on the user control themselves – specifically properties;

  • Text – the text entered into the text box for filtering
  • Example Text  – text to be displayed to hint the user as to what they can say if the control is in listening mode
  • SubmitCommand – the command to be executed when the user finishes speaking or if they manually hit the “go” button.

and the control defines these properties and a few more as in the code below;

namespace FlickrConversation.Controls
{
  using System;
  using System.Collections.Generic;
  using System.Linq;
  using System.Text.RegularExpressions;
  using System.Threading;
  using System.Threading.Tasks;
  using System.Windows.Input;
  using Windows.Media.SpeechRecognition;
  using Windows.UI.Xaml;
  using Windows.UI.Xaml.Controls;

  public sealed partial class SpeechControl : UserControl
  {
    enum VisualState
    {
      Listening,
      Waiting,
      Editing
    };

    public static DependencyProperty ExampleTextProperty =
      DependencyProperty.Register(
        "ExampleText", typeof(string), typeof(SpeechControl), null);

    public static DependencyProperty SingleCaptureMatchExpressionProperty =
      DependencyProperty.Register(
        "SingleCaptureMatchExpression", typeof(string), typeof(SpeechControl), null);

    public static DependencyProperty TextProperty =
      DependencyProperty.Register(
        "Text", typeof(string), typeof(SpeechControl),
        new PropertyMetadata(string.Empty, OnTextChangedCallback));

    public static DependencyProperty IsListeningProperty =
      DependencyProperty.Register(
        "IsListening", typeof(bool), typeof(SpeechControl),
        new PropertyMetadata(false, OnIsListeningChangedCallback));

    public static DependencyProperty SubmitCommandProperty =
      DependencyProperty.Register(
        "SubmitCommand", typeof(ICommand), typeof(SpeechControl), null);

    public SpeechControl()
    {
      this.InitializeComponent();
      this.currentState = VisualState.Waiting;
    }
    public string ExampleText
    {
      get
      {
        return ((string)base.GetValue(ExampleTextProperty));
      }
      set
      {
        base.SetValue(ExampleTextProperty, value);
      }
    }
    public string SingleCaptureMatchExpression
    {
      get
      {
        return ((string)base.GetValue(SingleCaptureMatchExpressionProperty));
      }
      set
      {
        base.SetValue(SingleCaptureMatchExpressionProperty, value);
      }
    }
    public string Text
    {
      get
      {
        return ((string)base.GetValue(TextProperty));
      }
      set
      {
        base.SetValue(TextProperty, value);
      }
    }
    public bool IsListening
    {
      get
      {
        return ((bool)base.GetValue(IsListeningProperty));
      }
      set
      {
        base.SetValue(IsListeningProperty, value);
      }
    }
    public ICommand SubmitCommand
    {
      get
      {
        return ((ICommand)base.GetValue(SubmitCommandProperty));
      }
      set
      {
        base.SetValue(SubmitCommandProperty, value);
      }
    }
    void ChangeState(VisualState newState)
    {
      if (newState != this.currentState)
      {
        this.currentState = newState;
        VisualStateManager.GoToState(this, this.currentState.ToString(), true);
      }
    }
    void Listen()
    {
      this.ChangeState(VisualState.Listening);
      this.Recognize();
    }
    private async Task BuildRecognizer()
    {
      if (this.recognizer == null)
      {
        this.recognizer = new SpeechRecognizer();
        this.recognizer.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(0);
        this.recognizer.Timeouts.EndSilenceTimeout = TimeSpan.FromSeconds(1);
        this.recognizer.Timeouts.InitialSilenceTimeout = TimeSpan.FromSeconds(5);
        await this.recognizer.CompileConstraintsAsync();
      }
    }
    async void Recognize()
    {
      bool listenAgain = false;

      if (this.SubmitCommand != null)
      {
        SpeechRecognitionResult speechResult = null;

        await this.BuildRecognizer();
        this.tokenSource = new CancellationTokenSource();

        try
        {
          speechResult = await this.recognizer.RecognizeAsync().AsTask(this.tokenSource.Token);
        }
        catch (OperationCanceledException)
        {

        }
        this.IsListening = false;
        this.tokenSource.Dispose();
        this.tokenSource = null;

        if (
          (speechResult != null) &&
          (speechResult.Status == SpeechRecognitionResultStatus.Success) &&
          !string.IsNullOrEmpty(speechResult.Text) &&
          (this.currentState == VisualState.Listening) &&
          (!string.IsNullOrEmpty(this.SingleCaptureMatchExpression)))
        {
          var match = this.MatchSpeechResults(speechResult);
          if (!string.IsNullOrEmpty(match))
          {
            this.Text = match;

            if (this.SubmitCommand.CanExecute(null))
            {
              this.SubmitCommand.Execute(null);
              listenAgain = true;
            }
          }
        }
      }
      if (listenAgain)
      {
        this.IsListening = true;
      }
      else
      {
        this.ChangeState(VisualState.Waiting);
      }
    }
    string MatchSpeechResults(SpeechRecognitionResult results)
    {
      var match = string.Empty;
      var alternates = results.GetAlternates(10);
      var candidates = new List<string>();
      candidates.Add(results.Text.ToLower());
      if (alternates != null)
      {
        candidates.AddRange(alternates.Select(r => r.Text.ToLower()));
      }

      foreach (var candidate in candidates)
      {
        Regex expression = new Regex(this.SingleCaptureMatchExpression);
        Match exprMatch = expression.Match(candidate);

        if (exprMatch.Success && (exprMatch.Groups != null))
        {
          match = exprMatch.Groups[exprMatch.Groups.Count - 1].Value;
          break;
        }
      }
      return (match);
    }
    void OnCancelSpeechButton(object sender, RoutedEventArgs e)
    {
      if (this.tokenSource != null)
      {
        this.tokenSource.Cancel();
        this.ChangeState(VisualState.Waiting);
      }
    }
    void OnSpeechButton(object sender, RoutedEventArgs e)
    {
      this.Listen();
    }
    void OnTextChanged(object sender, TextChangedEventArgs e)
    {
      ((TextBox)sender).GetBindingExpression (TextBox.TextProperty).UpdateSource();
    }
    static void OnIsListeningChangedCallback(DependencyObject sender,
      DependencyPropertyChangedEventArgs args)
    {
      SpeechControl control = (SpeechControl)sender;

      if ((bool)args.NewValue)
      {
        control.Listen();
      }
    }
    static void OnTextChangedCallback(DependencyObject d,
      DependencyPropertyChangedEventArgs e)
    {
      string oldValue = (string)e.OldValue;
      string newValue = (string)e.NewValue;

      SpeechControl control = (SpeechControl)d;
      if (string.IsNullOrEmpty(oldValue) &&
        (!string.IsNullOrEmpty(newValue)))
      {
        control.ChangeState(VisualState.Editing);
      }
      if (string.IsNullOrEmpty(newValue))
      {
        control.ChangeState(VisualState.Waiting);
      }
    }
    VisualState currentState;
    SpeechRecognizer recognizer;
    CancellationTokenSource tokenSource;
  }
}

The idea here is that this control can be used in my UI with a few properties bound up. Here’s the usage of it in my image display page;

image

and the idea is that the underlying view model can toggle the IsListening property to true in order to get the control to move into “speech recognition” mode. The control will then do that for a short time before timing out and toggling that value back to false unless it recognises some speech in which case it will fire the SubmitCommand that it is bound to.

The SingleCaptureMatchExpression is meant to be a regular expression that the control can use to identify whether the speech that it has heard is in line with what the user of the control is expecting and in my view model this is just bound to a constant value – this is a bit inflexible in that the control just assumes that the regular expression will define one capture that it can extract so as to come up with a single word spoken by the user but it works for me here;

image

Trying it Out

With that in place, I can try out this idea of carrying on the conversation that’s initiated by Cortana and bringing it into my app.

Here’s a little screen capture of it in action – watch out for my clearing my throat in order to be able to speak! Smile

and that seems to work “reasonably nicely” and the control is built “reasonably generically” in order to maybe provide some re-use elsewhere in the future.

Code

If you want the code for this – it’s downloadable from here – you’ll find that I have removed my API key for the flickR API so you’ll need to put one of those into the code in order to make it work (the code will error on compilation).