Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.

Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;

image

and in this article on the Windows blog;

Using speech in your UWP apps: It’s good to talk

and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event  handler taken from a fairly blank UWP application;


async void OnLoaded(object sender, RoutedEventArgs args)
    {
      using (var synth = new SpeechSynthesizer())
      {
        using (var mediaPlayer = new MediaPlayer())
        {
          TaskCompletionSource<bool> source = null;

          mediaPlayer.MediaEnded += (s, e) =>
          {
            source.SetResult(true);
          };
          for (int i = 0; i < 100; i++)
          {
            var speechText = $"This is message number {i + 1}";
            source = new TaskCompletionSource<bool>();

            using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText))
            {
              mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType);
              mediaPlayer.Play();              
            }
            await source.Task;
            await Task.Delay(1000);
          }
        }
      }
    }

Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.

However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.

I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;

Clicking sound during start and stop of audio playback

and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.

The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.

That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;

and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.

The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.

With that in mind, I tried to write code which would;

  1. Create a temporary file
  2. Create an audio graph consisting of a connection between
    1. An AudioFileInputNode representing my temporary file
    2. An AudioDeviceOutputNode for the default audio rendering device on the system
  3. Perform Text to Speech
  4. Write the resulting stream to the temporary file
  5. Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system

and my aim here was to avoid;

  1. having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
  2. having to create a separate temporary file for every piece of speech
  3. having to create an ever-growing temporary file containing all the pieces of speech concatenated together

and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.

In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.

The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      using (var speechSynthesizer = new SpeechSynthesizer())
      {
        var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media));

        if (graphResult.Status == AudioGraphCreationStatus.Success)
        {
          using (var graph = graphResult.Graph)
          {
            var outputResult = await graph.CreateDeviceOutputNodeAsync();

            if (outputResult.Status == AudioDeviceNodeCreationStatus.Success)
            {
              graph.Start();

              using (var outputNode = outputResult.DeviceOutputNode)
              {
                for (int i = 0; i < 100; i++)
                {
                  var speechText = $"This is message number {i + 1}";

                  await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile);

                  // TBD: I want to avoid this creating of 100 input file nodes but
                  // I don't seem (yet) to be able to get away from it so right now
                  // I keep creating new input nodes over the same file which changes
                  // every iteration of the loop.
                  var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile);

                  if (inputResult.Status == AudioFileNodeCreationStatus.Success)
                  {
                    using (var inputNode = inputResult.FileInputNode)
                    {
                      inputNode.AddOutgoingConnection(outputNode);
                      await inputNode.WaitForFileCompletedAsync();
                    }
                  }
                  await Task.Delay(1000);
                }
              }
              graph.Stop();
            }
          }
        }
      }
      await temporaryFile.DeleteAsync();
    }

and that code depends on a class that can create temporary files;

  public static class TemporaryFileCreator
  {
    public static async Task<StorageFile> CreateTemporaryFileAsync()
    {
      var fileName = $"{Guid.NewGuid()}.bin";

      var storageFile =
        await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName);

      return (storageFile);
    }
  }

and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;

 public static class SpeechSynthesizerExtensions
  {
    public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text)
    {
      var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      await SynthesizeTextToFileAsync(synthesizer, text, storageFile);

      return (storageFile);
    }
    public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file)
    {
      using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text))
      {
        using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite))
        {
          await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream);
        }
      }
    }
  }

and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;

  public static class AudioInputFileNodeExtensions
  {
    public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode)
    {
      TypedEventHandler<AudioFileInputNode, object> handler = null;
      TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>();

      handler = (s, e) =>
      {
        s.FileCompleted -= handler;
        completed.SetResult(true);
      };
      inputNode.FileCompleted += handler;

      await completed.Task;
    }
  }

This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;

  • Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
  • Get rid of that initial audible ‘pop’.

I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…

“Project Oxford”–Speaker Identification from a Windows 10/UWP App

Following up on this earlier post, I wanted to get a feel for what the speaker identification part of the “Project Oxford” speaker recognition APIs look like having toyed with verification in that previous post.

It’s interesting to see the difference between the capability of the two areas of functionality and how it shapes the APIs that the service offers.

For verification, a profile is built by capturing the user repeating one of a set of supported phrases 3 times over and submitting the captured audio. These are short phrases. Once the profile is built, the user can be prompted to submit a phrase that can be tested against the profile for a ‘yes/no’ match.

Identification is a different beast. The enrolment phase involves building a profile by capturing the user talking for 60 seconds and submitting it to the service for analysis. It’s worth saying that all 60 seconds don’t have to be at once but the minimum duration is 20 seconds.

The service then processes that speech and provides a ‘call me back’ style endpoint which the client must poll to later gather the results. It’s possible that the results of processing will be a request for more speech to analyse in order to complete the profile and so there’s a possibility of looping to build the profile.

Once built, the identification phase is achieved by submitting another 60 seconds of the user speaking along with (at the time of writing) a list of up to 10 profiles to check against.

So, while it’s possible to build up to 1000 profiles at the service, identification only runs against 10 of them at a time right now.

Again, this submission results in a ‘call me back’ URL which the client can return to later for results.

Clearly, identification is a much harder problem to solve than verification and it’s reflected in the APIs here although I suspect that, over time, the amount of speech required and the number of profiles that can be checked in one call will change.

In terms of actually calling the APIs, it would be worth referring back to my previous post because it talked about where to find the official (non UWP) samples and has links across to the “Oxford” documentation whereas what I’m doing here is adapting my previous code to work with the identification APIs rather than the verification ones.

In doing that, I made my little test app speech-centric rather than mouse/keyboard centric and it ended up working as shown in the video below (NB: this video has 2+ minutes of me reading from a script on the web, feel free to jump around to skip those bits Smile);

In most of my tests, I found that I had to submit more than 1 batch of speech as part of the enrolment phase but I got a little lucky with this example that I recorded and enrolment happened in one go which surprised me.

Clearly, I’d need to go gather a slightly larger user community for this than 1 person to get a better test on it but it seems like it’s working reasonably here.

I’ve posted the code for this here for download – it’s fairly rough-and-ready and there’s precious little error handling in there plus it’s more of a code-behind sample lacking in much structure.

As before, if you want to build this out yourself you’ll need an API key for the “Oxford” API and you’ll need it to get the file named keys.cs to compile.

A Quick Parse of the //Build 2015 Session List

I just flicked through the session list for //Build and was making a note of the sessions that (on first pass) line up with my particular set of interests and I thought I’d publish it here given that I was making the list anyway Smile

Note that the list;

  • isn’t sorted or prioritised in any way
  • has been made without any visibility about what’s happening in any of these sessions
  • is long – looks like around 80-100 hours of material to watch here at least
  • would be likely to be revised once the material starts to show up

With a big conference like this, I always watch the keynotes and then I typically make a few folders on my disk;

  • queue
  • watched
    • vital
    • useful
    • other

and I download as many videos as I can into that queue folder and then watch them and, once watched, I partition them across into one of the other 3 folders for future reference. It’s worth saying that ‘other’ category is often a reflection of whether I think the session is of use to me rather than a reflection on a talk.

I do generally download the videos because I can then watch them when travelling and so on and also because I usually watch them at 1.4x speed or similar.

Here’s the big list (so far);

Windows App sessions

Cross Platform App sessions

IoT sessions

Visual Studio sessions

Web sessions

.NET sessions

Azure sessions

Security sessions

Some ‘mystery’ sounding sessions;