My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.
Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;
and in this article on the Windows blog;
and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event handler taken from a fairly blank UWP application;
async void OnLoaded(object sender, RoutedEventArgs args) { using (var synth = new SpeechSynthesizer()) { using (var mediaPlayer = new MediaPlayer()) { TaskCompletionSource<bool> source = null; mediaPlayer.MediaEnded += (s, e) => { source.SetResult(true); }; for (int i = 0; i < 100; i++) { var speechText = $"This is message number {i + 1}"; source = new TaskCompletionSource<bool>(); using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText)) { mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType); mediaPlayer.Play(); } await source.Task; await Task.Delay(1000); } } } }
Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.
However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.
I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;
and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.
The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.
That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;
- Files (via AudioFileInputNode)
- Devices (via AudioDeviceInputNode)
- Frames (via AudioFrameInputNode)
and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.
The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.
With that in mind, I tried to write code which would;
- Create a temporary file
- Create an audio graph consisting of a connection between
- An AudioFileInputNode representing my temporary file
- An AudioDeviceOutputNode for the default audio rendering device on the system
- Perform Text to Speech
- Write the resulting stream to the temporary file
- Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system
and my aim here was to avoid;
- having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
- having to create a separate temporary file for every piece of speech
- having to create an ever-growing temporary file containing all the pieces of speech concatenated together
and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.
In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.
The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.
async void OnLoaded(object sender, RoutedEventArgs args) { var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync(); using (var speechSynthesizer = new SpeechSynthesizer()) { var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media)); if (graphResult.Status == AudioGraphCreationStatus.Success) { using (var graph = graphResult.Graph) { var outputResult = await graph.CreateDeviceOutputNodeAsync(); if (outputResult.Status == AudioDeviceNodeCreationStatus.Success) { graph.Start(); using (var outputNode = outputResult.DeviceOutputNode) { for (int i = 0; i < 100; i++) { var speechText = $"This is message number {i + 1}"; await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile); // TBD: I want to avoid this creating of 100 input file nodes but // I don't seem (yet) to be able to get away from it so right now // I keep creating new input nodes over the same file which changes // every iteration of the loop. var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile); if (inputResult.Status == AudioFileNodeCreationStatus.Success) { using (var inputNode = inputResult.FileInputNode) { inputNode.AddOutgoingConnection(outputNode); await inputNode.WaitForFileCompletedAsync(); } } await Task.Delay(1000); } } graph.Stop(); } } } } await temporaryFile.DeleteAsync(); }
and that code depends on a class that can create temporary files;
public static class TemporaryFileCreator { public static async Task<StorageFile> CreateTemporaryFileAsync() { var fileName = $"{Guid.NewGuid()}.bin"; var storageFile = await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName); return (storageFile); } }
and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;
public static class SpeechSynthesizerExtensions { public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text) { var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync(); await SynthesizeTextToFileAsync(synthesizer, text, storageFile); return (storageFile); } public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file) { using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text)) { using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite)) { await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream); } } } }
and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;
public static class AudioInputFileNodeExtensions { public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode) { TypedEventHandler<AudioFileInputNode, object> handler = null; TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>(); handler = (s, e) => { s.FileCompleted -= handler; completed.SetResult(true); }; inputNode.FileCompleted += handler; await completed.Task; } }
This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;
- Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
- Get rid of that initial audible ‘pop’.
I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…