Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.

Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;

image

and in this article on the Windows blog;

Using speech in your UWP apps: It’s good to talk

and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event  handler taken from a fairly blank UWP application;


async void OnLoaded(object sender, RoutedEventArgs args)
    {
      using (var synth = new SpeechSynthesizer())
      {
        using (var mediaPlayer = new MediaPlayer())
        {
          TaskCompletionSource<bool> source = null;

          mediaPlayer.MediaEnded += (s, e) =>
          {
            source.SetResult(true);
          };
          for (int i = 0; i < 100; i++)
          {
            var speechText = $"This is message number {i + 1}";
            source = new TaskCompletionSource<bool>();

            using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText))
            {
              mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType);
              mediaPlayer.Play();              
            }
            await source.Task;
            await Task.Delay(1000);
          }
        }
      }
    }

Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.

However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.

I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;

Clicking sound during start and stop of audio playback

and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.

The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.

That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;

and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.

The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.

With that in mind, I tried to write code which would;

  1. Create a temporary file
  2. Create an audio graph consisting of a connection between
    1. An AudioFileInputNode representing my temporary file
    2. An AudioDeviceOutputNode for the default audio rendering device on the system
  3. Perform Text to Speech
  4. Write the resulting stream to the temporary file
  5. Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system

and my aim here was to avoid;

  1. having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
  2. having to create a separate temporary file for every piece of speech
  3. having to create an ever-growing temporary file containing all the pieces of speech concatenated together

and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.

In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.

The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      using (var speechSynthesizer = new SpeechSynthesizer())
      {
        var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media));

        if (graphResult.Status == AudioGraphCreationStatus.Success)
        {
          using (var graph = graphResult.Graph)
          {
            var outputResult = await graph.CreateDeviceOutputNodeAsync();

            if (outputResult.Status == AudioDeviceNodeCreationStatus.Success)
            {
              graph.Start();

              using (var outputNode = outputResult.DeviceOutputNode)
              {
                for (int i = 0; i < 100; i++)
                {
                  var speechText = $"This is message number {i + 1}";

                  await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile);

                  // TBD: I want to avoid this creating of 100 input file nodes but
                  // I don't seem (yet) to be able to get away from it so right now
                  // I keep creating new input nodes over the same file which changes
                  // every iteration of the loop.
                  var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile);

                  if (inputResult.Status == AudioFileNodeCreationStatus.Success)
                  {
                    using (var inputNode = inputResult.FileInputNode)
                    {
                      inputNode.AddOutgoingConnection(outputNode);
                      await inputNode.WaitForFileCompletedAsync();
                    }
                  }
                  await Task.Delay(1000);
                }
              }
              graph.Stop();
            }
          }
        }
      }
      await temporaryFile.DeleteAsync();
    }

and that code depends on a class that can create temporary files;

  public static class TemporaryFileCreator
  {
    public static async Task<StorageFile> CreateTemporaryFileAsync()
    {
      var fileName = $"{Guid.NewGuid()}.bin";

      var storageFile =
        await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName);

      return (storageFile);
    }
  }

and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;

 public static class SpeechSynthesizerExtensions
  {
    public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text)
    {
      var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      await SynthesizeTextToFileAsync(synthesizer, text, storageFile);

      return (storageFile);
    }
    public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file)
    {
      using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text))
      {
        using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite))
        {
          await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream);
        }
      }
    }
  }

and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;

  public static class AudioInputFileNodeExtensions
  {
    public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode)
    {
      TypedEventHandler<AudioFileInputNode, object> handler = null;
      TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>();

      handler = (s, e) =>
      {
        s.FileCompleted -= handler;
        completed.SetResult(true);
      };
      inputNode.FileCompleted += handler;

      await completed.Task;
    }
  }

This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;

  • Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
  • Get rid of that initial audible ‘pop’.

I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…

4 thoughts on “Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

  1. I think it’s a common problem with HDMI (“hdmi silent stream”). You can mitigate it by playing a waveform full of zeros (or some very silent noise if the drivers are being extra clever) in the background for the entire duration you want to be “pop-free”, and play your actual sound over that.

  2. Hi Mike,

    I found a better way to render the SpeechSynthesizerStream through AudioGraph without the need for temporary files. In short, I use the AudioFrameInputNode as outlined in the AudioCreation sample of the Windows-Universal-Samples collection.

    I’ve documented this approach on my blog, which you can find here: http://ian.bebbs.co.uk/posts/CombiningUwpSpeechSynthesizerWithAudioGraph

    Unfortunately it doesn’t solve the issue with the popping noise on a RaspberryPi when the application starts but I am still investigating this.

    Cheers,
    Ian

    1. Nice work Ian – I had wondered what it was meant to go down the route of producing the audio frames and I have played with WAV/RIFF files before (on this blog :-)) but I chickened out and went down the route of the temporary file so it’s great to see you make this work (albeit caveat to the audio popping which still crops up).

      1. Quick update: Turns out the USB Sound Adapter I referred to in my blog does indeed prevent the ‘popping’ noises at application start-up. I assume it would also resolve the issue with popping noises when emitting speech through MediaPlayer (but haven’t tried this).

Comments are closed.