Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.

Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;

image

and in this article on the Windows blog;

Using speech in your UWP apps: It’s good to talk

and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event  handler taken from a fairly blank UWP application;


async void OnLoaded(object sender, RoutedEventArgs args)
    {
      using (var synth = new SpeechSynthesizer())
      {
        using (var mediaPlayer = new MediaPlayer())
        {
          TaskCompletionSource<bool> source = null;

          mediaPlayer.MediaEnded += (s, e) =>
          {
            source.SetResult(true);
          };
          for (int i = 0; i < 100; i++)
          {
            var speechText = $"This is message number {i + 1}";
            source = new TaskCompletionSource<bool>();

            using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText))
            {
              mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType);
              mediaPlayer.Play();              
            }
            await source.Task;
            await Task.Delay(1000);
          }
        }
      }
    }

Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.

However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.

I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;

Clicking sound during start and stop of audio playback

and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.

The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.

That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;

and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.

The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.

With that in mind, I tried to write code which would;

  1. Create a temporary file
  2. Create an audio graph consisting of a connection between
    1. An AudioFileInputNode representing my temporary file
    2. An AudioDeviceOutputNode for the default audio rendering device on the system
  3. Perform Text to Speech
  4. Write the resulting stream to the temporary file
  5. Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system

and my aim here was to avoid;

  1. having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
  2. having to create a separate temporary file for every piece of speech
  3. having to create an ever-growing temporary file containing all the pieces of speech concatenated together

and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.

In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.

The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      using (var speechSynthesizer = new SpeechSynthesizer())
      {
        var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media));

        if (graphResult.Status == AudioGraphCreationStatus.Success)
        {
          using (var graph = graphResult.Graph)
          {
            var outputResult = await graph.CreateDeviceOutputNodeAsync();

            if (outputResult.Status == AudioDeviceNodeCreationStatus.Success)
            {
              graph.Start();

              using (var outputNode = outputResult.DeviceOutputNode)
              {
                for (int i = 0; i < 100; i++)
                {
                  var speechText = $"This is message number {i + 1}";

                  await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile);

                  // TBD: I want to avoid this creating of 100 input file nodes but
                  // I don't seem (yet) to be able to get away from it so right now
                  // I keep creating new input nodes over the same file which changes
                  // every iteration of the loop.
                  var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile);

                  if (inputResult.Status == AudioFileNodeCreationStatus.Success)
                  {
                    using (var inputNode = inputResult.FileInputNode)
                    {
                      inputNode.AddOutgoingConnection(outputNode);
                      await inputNode.WaitForFileCompletedAsync();
                    }
                  }
                  await Task.Delay(1000);
                }
              }
              graph.Stop();
            }
          }
        }
      }
      await temporaryFile.DeleteAsync();
    }

and that code depends on a class that can create temporary files;

  public static class TemporaryFileCreator
  {
    public static async Task<StorageFile> CreateTemporaryFileAsync()
    {
      var fileName = $"{Guid.NewGuid()}.bin";

      var storageFile =
        await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName);

      return (storageFile);
    }
  }

and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;

 public static class SpeechSynthesizerExtensions
  {
    public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text)
    {
      var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      await SynthesizeTextToFileAsync(synthesizer, text, storageFile);

      return (storageFile);
    }
    public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file)
    {
      using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text))
      {
        using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite))
        {
          await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream);
        }
      }
    }
  }

and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;

  public static class AudioInputFileNodeExtensions
  {
    public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode)
    {
      TypedEventHandler<AudioFileInputNode, object> handler = null;
      TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>();

      handler = (s, e) =>
      {
        s.FileCompleted -= handler;
        completed.SetResult(true);
      };
      inputNode.FileCompleted += handler;

      await completed.Task;
    }
  }

This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;

  • Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
  • Get rid of that initial audible ‘pop’.

I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…

More on Lightbulbs–Bit of Fun With Windows 10, UWP, IoT Core, AllJoyn

Partly for a demo but also a little bit for fun, I took the ideas that I was playing with in this post;

“Windows 10, UWP, Raspberry PI 2, AllJoyn and Lightbulb Demo”

and expanded them out into more of a mock home-automation scenario where I’ve got a simple app that you run on a device and define a ‘configuration’ of a building in terms of a few rooms and a few lightbulbs within it. The app then lets you turn the lights on/off and also makes that possible over the network by offering an AllJoyn service and, on IoT Core, it’ll also talk to GPIO pins to turn real lights on/off.

It’s perhaps better explained by a little demo;

and the code (which isn’t particularly well written but wouldn’t take huge effort to polish up) is on github.

Windows 10 IoT Core, Raspberry PI 3 and 14342

By accident today, my Raspberry PI 3 got upgraded to Windows 10 Build 14342 – I was in the process of working on it when I noticed that it wanted to ‘update and restart’ and I’d had it on the preview build 14295 for quite a long time so I thought ‘why not?’ and let it upgrade.

I’ve been keeping my PI 2s on build 10586 because I have some things that I know work on there which didn’t work for me on 14295 on the PI 3.

However, I was attracted to 14295 because it has the option to support remote display which is nice for someone like me who often wants to be able to show people what’s going on from a UI perspective on a PI device and, to date, I’ve mostly been doing that by connecting the PI to either a projector or to a little portable LCD screen that I carry in my bag along with its power brick and its HDMI cable.

A few less things to carry is always a bonus.

I’m also attracted to the PI 3 because it’s faster and because it has built in WiFi and Bluetooth but, again, they weren’t working yet in the IoT Core preview 14295.

I could switch my OS versions around a little more by swapping around SD cards but I’ve tended to stick with this view of my PI 3 as being the ‘preview’ device which has new features but isn’t quite working for me yet and my PI 2s as being devices that were a bit behind the times but were reliable Smile

With build 14342 though things look to have improved quite a bit in that there now is support for the built-in WiFi/Bluetooth although a known issue from;

Release Notes

which says that;

  • “Ethernet and WiFi on the Raspberry Pi 3 may fail on boot a reboot is required to resolve the issue. This issue is more prominent on slower SD cards.”

This is currently blocking me from trying this out although I’m going to try some different SD cards and see if I can get around it.

I like where 14342 is going though because it means that I can simply put 3 devices on a desk;

  • Windows 10 PC
  • Windows 10 Mobile
  • Windows 10 IoT Core

and without needing additional dongles I can network them by setting up my phone as a wireless hotspot and then have the 3 devices all joint that network. I can then get all 3 devices to display output onto the PC screen by;

  • Using the ‘Project My Screen’ app to get my phone’s display onto my PC’s screen.
  • Using the ‘Windows IoT Remote Client’ app to get my Raspberry PI 3’s display onto my PC’s screen.

which makes it a lot easier to demo these things.

That said, at the time of writing I don’t have a version of the ‘Project My Screen’ app that works with the modern phones (specifically my Lumia 950XL) which is a bit of a shame and it’s hard to know whether a fix is coming there because the Anniversary Edition offers the new Connect app which would let me Miracast the phone to the PC screen anyway but, right now, that doesn’t help on my PC running 10586.

In a similar vein, I’d quite like to see IoT Core offering the same ‘Wireless Hotspot’ functionality that Windows Phone offers – i.e. it’d be nice to be able to use the PI 3 by plugging it into a wired internet connection and then sharing a network out to other devices.

Hopefully, that’ll show up in some future IoT Core version. In the meantime, I’ll update this post if I can get the built-in WiFi/Bluetooth working on a PI 3 running 14342…

Update 1

Shortly after writing this post, I did try another SD card and it worked out in that I could remove the WiFi dongle and see that there was still the built-in WiFi adapter that I could use;

image

and, sure enough, I can then use the remote client to connect to the device over WiFi on the built-in adapter;

image

I’ve yet to try bluetooth on that adapter, I’ll update the post as/when I try that out but this is definitely progress for me on the PI 3 Smile

Update 2

I did get ‘Project My Screen’ working. I’ve had issues with it across 3 PCs for a long time and I played with drivers and all kinds of things but never got it to work.

It worked for me today on my Surface Book. How come? Because I’ve (very) recently got a replacement 950XL phone that I hadn’t yet tried it with so I have to conclude that there was something about my previous 950XL that made it not work with ‘Project My Screen’. This new one worked first time on the same system where things had previously failed.