NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.
Voice is really important as both an input and output mechanism on the HoloLens and there’s a great section in the documentation over here on working with it;
and it’s fairly easy to add the ability to have the device speak in that I can add the “Text To Speech Manager” script onto a game object in Unity;
and that script comes from the Utilities section of the HoloToolkit-Unity;
and once I’ve got that in place I can easily ask it to speak on my behalf with code such as this which is running within a method on a script that is attached to my GameObject;
var textToSpeech = this.GetComponent<TextToSpeechManager>(); textToSpeech.SpeakText("One"); textToSpeech.SpeakText("Two"); textToSpeech.SpeakText("Three");
And you can see the code for the TextToSpeechManager over here on GitHub.
There’s one caveat though in the code above and that’s that the model employed by the Text to Speech manager code here is one of ‘fire and forget’ and so my code above which is trying to have 3 distinct pieces of speech spoken doesn’t really work and what I usually hear when I run that code is a spoken output of;
“Three”
because the third call overruns the second call which overruns the first call.
I wanted to see if I could tweak that a little and so I wrote some exploratory code based on what I saw in a slightly earlier check-in of the Text to Speech manager and my main change was to try and alter the implementation such that calls to SpeakText or SpeakSsml were effectively queued such that a second call would execute after the first one had completed playing.
I initially thought that this would be pretty easy but it turned out that the underlying Unity AudioSource object that underpins this code doesn’t really seem to have a great way of letting you know when the audio has stopped playing. It seems that the options are to either;
- Call the Play() method or the PlayScheduled() method with some kind of delay so as to delay a particular piece of speech until the pieces of speech that have gone before it have already finished playing.
- Poll the isPlaying property to see when playback has ended.
There may be other/better mechanisms but that’s all I found by doing a search around the web.
The existing TextToSpeechManager code already does work (on lines 239 onwards of function PlaySpeech) to ensure that it moves most of the work of generating speech from text via the UWP’s SpeechSynthesizer APIs into a separate task but (as the comments in the code around line 297 of function PlaySpeech say) the actual playback of the audio has to happen on the main Unity thread although I don’t believe that the call to Play is a blocking call that would halt that Unity thread.
I wanted to leave as much of this code in place as possible while adding in my extra pieces that queued up speech rather than always trying to play it even if the AudioSource was already busy which is what the existing code seems to do and it felt like to do that I would have to implement some kind of queuing mechanism which took into account;
- Needing to be able to deal with the idea that the production of the speech itself is done asynchronously.
- Needing to be able to poll the isPlaying flag on some frequency once the speech playback had started in order to determine completion by polling.
Update? Coroutines? Tasks?
At this point, I could see a few different ways in which I might be able to implement this functionality with Unity and I figured that I could maybe;
- Do some work from a call to Update() to poll to see if any current speech had finished playing.
- Use Unity’s Coroutines in order to see if I could poll the speech status from there.
- Try and wrap up something that used a TaskCompletionSource and which presented the polling as something that could be awaited in C#.
The last one is perhaps the most elegant but in the end I went with using the InvokeRepeating method to schedule some work to be checked ‘every so often’. There are probably better ways of doing this but it’s all part of learning
In order to get something going, I took the existing code and did a few things.
1 – Refactoring into a UnityAudioHelper
I took some of the code from the existing TextToSpeechManager and refactored it into this ‘audio helper’ class below. Largely, this is just the original code move into its own static class;
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. See LICENSE in the project root for license information. using System; using System.Collections; using System.Collections.Generic; using UnityEngine; #if WINDOWS_UWP using System.Threading.Tasks; using Windows.Foundation; using Windows.Media.SpeechSynthesis; using System.Runtime.InteropServices.WindowsRuntime; using Windows.Storage.Streams; #endif namespace HoloToolkit.Unity { public static class UnityAudioHelper { /// <summary> /// Converts two bytes to one float in the range -1 to 1 /// </summary> /// <param name="firstByte"> /// The first byte. /// </param> /// <param name="secondByte"> /// The second byte. /// </param> /// <returns> /// The converted float. /// </returns> private static float BytesToFloat(byte firstByte, byte secondByte) { // Convert two bytes to one short (little endian) short s = (short)((secondByte << 8) | firstByte); // Convert to range from -1 to (just below) 1 return s / 32768.0F; } /// <summary> /// Dynamically creates an <see cref="AudioClip"/> that represents raw Unity audio data. /// </summary> /// <param name="name"> /// The name of the dynamically generated clip. /// </param> /// <param name="audioData"> /// Raw Unity audio data. /// </param> /// <param name="sampleCount"> /// The number of samples in the audio data. /// </param> /// <param name="frequency"> /// The frequency of the audio data. /// </param> /// <returns> /// The <see cref="AudioClip"/>. /// </returns> internal static AudioClip ToClip(string name, float[] audioData, int sampleCount, int frequency) { // Create the audio clip var clip = AudioClip.Create(name, sampleCount, 1, frequency, false); // Set the data clip.SetData(audioData, 0); // Done return clip; } /// <summary> /// Converts raw WAV data into Unity formatted audio data. /// </summary> /// <param name="wavAudio"> /// The raw WAV data. /// </param> /// <param name="sampleCount"> /// The number of samples in the audio data. /// </param> /// <param name="frequency"> /// The frequency of the audio data. /// </param> /// <returns> /// The Unity formatted audio data. /// </returns> internal static float[] ToUnityAudio(byte[] wavAudio, out int sampleCount, out int frequency) { // Determine if mono or stereo int channelCount = wavAudio[22]; // Speech audio data is always mono but read actual header value for processing // Get the frequency frequency = BitConverter.ToInt32(wavAudio, 24); // Get past all the other sub chunks to get to the data subchunk: int pos = 12; // First subchunk ID from 12 to 16 // Keep iterating until we find the data chunk (i.e. 64 61 74 61 ...... (i.e. 100 97 116 97 in decimal)) while (!(wavAudio[pos] == 100 && wavAudio[pos + 1] == 97 && wavAudio[pos + 2] == 116 && wavAudio[pos + 3] == 97)) { pos += 4; int chunkSize = wavAudio[pos] + wavAudio[pos + 1] * 256 + wavAudio[pos + 2] * 65536 + wavAudio[pos + 3] * 16777216; pos += 4 + chunkSize; } pos += 8; // Pos is now positioned to start of actual sound data. sampleCount = (wavAudio.Length - pos) / 2; // 2 bytes per sample (16 bit sound mono) if (channelCount == 2) sampleCount /= 2; // 4 bytes per sample (16 bit stereo) // Allocate memory (supporting left channel only) float[] unityData = new float[sampleCount]; // Write to double array/s: int i = 0; while (pos < wavAudio.Length) { unityData[i] = BytesToFloat(wavAudio[pos], wavAudio[pos + 1]); pos += 2; if (channelCount == 2) { pos += 2; } i++; } // Done return unityData; } #if WINDOWS_UWP internal static async Task<byte[]> SynthesizeToUnityDataAsync( string text, Func<string, IAsyncOperation<SpeechSynthesisStream>> speakFunc) { byte[] buffer = null; // Speak and get stream using (var speechStream = await speakFunc(text)) { // Create buffer buffer = new byte[speechStream.Size]; // Get input stream and the size of the original stream using (var inputStream = speechStream.GetInputStreamAt(0)) { await inputStream.ReadAsync(buffer.AsBuffer(), (uint)buffer.Length, InputStreamOptions.None); } } return (buffer); } #endif } }
2 – Add a ‘Queue Worker’
I added in my own abstract base class which tries to encapsulate the idea of a queue of work to be processed where items are taken off the queue and worked upon in sequence. This is quite a generic problem to solve and you can get into aspects of multi-threading and so on which I’ve avoided here and this little class isn’t as generic as it could be because I’m bending it to my specific requirements here in that;
- A queue is polled periodically rather than (e.g.) signalling some kind of synchronization object when work is available. This isn’t how I’d perhaps usually write this sort of class but I have a specific requirement to poll the AudioSource plus Unity’s model is quite amenable to polling.
- An item of work is de-queued.
- The item of work is executed and it is assumed that the item will take steps to avoid blocking.
- The completion of the item of work is determined by polling some method to check a ‘completed’ status.
and my little queue class ended up looking like this;
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. See LICENSE in the project root for license information. using System.Collections; using System.Collections.Generic; using UnityEngine; using System; public abstract class IntervalWorkQueue : MonoBehaviour { [Tooltip("The interval (sec) on which to check queued speech.")] [SerializeField] private float queueInterval = 0.25f; public enum WorkState { Idle, Starting, PollingForCompletion } public IntervalWorkQueue() { this.workState = WorkState.Idle; this.queueEntries = new Queue<object>(); } public void AddWorkItem(object workItem) { this.queueEntries.Enqueue(workItem); } public void Start() { base.InvokeRepeating("ProcessQueue", queueInterval, queueInterval); } void ProcessQueue() { if ((this.workState == WorkState.Starting) && (this.WorkIsInProgress)) { this.workState = WorkState.PollingForCompletion; } if ((this.workState == WorkState.PollingForCompletion) && (!this.WorkIsInProgress)) { this.workState = WorkState.Idle; } if ((this.workState == WorkState.Idle) && (this.WorkedIsQueued)) { this.workState = WorkState.Starting; object workEntry = this.queueEntries.Dequeue(); this.DoWorkItem(workEntry); } } protected bool WorkedIsQueued { get { return (this.queueEntries.Count > 0); } } protected abstract void DoWorkItem(object item); protected abstract bool WorkIsInProgress { get; } WorkState workState; Queue<object> queueEntries; }
3 – Derive a Text to Speech Manager
I derive a new variant of the original TextToSpeechManager from my new IntervalWorkQueue as in the code snippet below and this class is making calls out to the refactored UnityAudioHelper which I listed earlier. The main ‘features’ here are that the Start() method calls base.Start() in order to get the interval work queue up and running and that the DoWorkItem and WorkIsInProgress methods and properties have been overridden to call into the original code whereas the original PlaySpeech method has been reworked to simply call base.AddWorkItem to add an entry onto a queue.
// Copyright (c) Microsoft Corporation. All rights reserved. // Licensed under the MIT License. See LICENSE in the project root for license information. using System; using UnityEngine; using System.Collections; #if WINDOWS_UWP using Windows.Foundation; using Windows.Media.SpeechSynthesis; using Windows.Storage.Streams; using System.Linq; using System.Threading.Tasks; using System.Collections.Generic; using System.Runtime.InteropServices.WindowsRuntime; #endif // WINDOWS_UWP namespace HoloToolkit.Unity { /// <summary> /// The well-know voices that can be used by <see cref="TextToSpeechManager"/>. /// </summary> public enum TextToSpeechVoice { /// <summary> /// The default system voice. /// </summary> Default, /// <summary> /// Microsoft David Mobile /// </summary> David, /// <summary> /// Microsoft Mark Mobile /// </summary> Mark, /// <summary> /// Microsoft Zira Mobile /// </summary> Zira } public class TextToSpeechManager : IntervalWorkQueue { [Tooltip("The audio source where speech will be played.")] [SerializeField] private AudioSource audioSource; [Tooltip("The voice that will be used to generate speech.")] [SerializeField] private TextToSpeechVoice voice; public AudioSource AudioSource { get { return (this.audioSource); } set { this.audioSource = value; } } public TextToSpeechVoice Voice { get { return (this.voice); } set { this.voice = value; } } /// <summary> /// Speaks the specified SSML markup using text-to-speech. /// </summary> /// <param name="ssml"> /// The SSML markup to speak. /// </param> public void SpeakSsml(string ssml) { // Make sure there's something to speak if (string.IsNullOrEmpty(ssml)) { return; } // Pass to helper method #if WINDOWS_UWP PlaySpeech(ssml, this.voice, synthesizer.SynthesizeSsmlToStreamAsync); #else LogSpeech(ssml); #endif } /// <summary> /// Speaks the specified text using text-to-speech. /// </summary> /// <param name="text"> /// The text to speak. /// </param> public void SpeakText(string text) { // Make sure there's something to speak if (string.IsNullOrEmpty(text)) { return; } // Pass to helper method #if WINDOWS_UWP PlaySpeech(text, this.voice, synthesizer.SynthesizeTextToStreamAsync); #else LogSpeech(text); #endif } /// <summary> /// Logs speech text that normally would have been played. /// </summary> /// <param name="text"> /// The speech text. /// </param> void LogSpeech(string text) { Debug.LogFormat("Speech not supported in editor. \"{0}\"", text); } public new void Start() { base.Start(); try { if (audioSource == null) { Debug.LogError("An AudioSource is required and should be assigned to 'Audio Source' in the inspector."); } else { #if WINDOWS_UWP this.synthesizer = new SpeechSynthesizer(); #endif } } catch (Exception ex) { Debug.LogError("Could not start Speech Synthesis"); Debug.LogException(ex); } } protected override void DoWorkItem(object item) { #if WINDOWS_UWP try { SpeechEntry speechEntry = item as SpeechEntry; // Need await, so most of this will be run as a new Task in its own thread. // This is good since it frees up Unity to keep running anyway. Task.Run(async () => { this.ChangeVoice(voice); var buffer = await UnityAudioHelper.SynthesizeToUnityDataAsync( speechEntry.Text, speechEntry.SpeechGenerator); // Convert raw WAV data into Unity audio data int sampleCount = 0; int frequency = 0; float[] unityData = null; unityData = UnityAudioHelper.ToUnityAudio( buffer, out sampleCount, out frequency); // The remainder must be done back on Unity's main thread UnityEngine.WSA.Application.InvokeOnAppThread( () => { // Convert to an audio clip var clip = UnityAudioHelper.ToClip( "Speech", unityData, sampleCount, frequency); // Set the source on the audio clip audioSource.clip = clip; // Play audio audioSource.Play(); }, false); }); } catch (Exception ex) { Debug.LogErrorFormat("Speech generation problem: \"{0}\"", ex.Message); } #endif } protected override bool WorkIsInProgress { get { #if WINDOWS_UWP return (this.audioSource.isPlaying); #else return (false); #endif } } #if WINDOWS_UWP class SpeechEntry { public string Text { get; set; } public TextToSpeechVoice Voice { get; set; } public Func<string, IAsyncOperation<SpeechSynthesisStream>> SpeechGenerator { get; set; } } private SpeechSynthesizer synthesizer; private VoiceInformation voiceInfo; /// <summary> /// Executes a function that generates a speech stream and then converts and plays it in Unity. /// </summary> /// <param name="text"> /// A raw text version of what's being spoken for use in debug messages when speech isn't supported. /// </param> /// <param name="speakFunc"> /// The actual function that will be executed to generate speech. /// </param> void PlaySpeech( string text, TextToSpeechVoice voice, Func<string, IAsyncOperation<SpeechSynthesisStream>> speakFunc) { // Make sure there's something to speak if (speakFunc == null) { throw new ArgumentNullException(nameof(speakFunc)); } if (synthesizer != null) { base.AddWorkItem( new SpeechEntry() { Text = text, Voice = voice, SpeechGenerator = speakFunc } ); } else { Debug.LogErrorFormat("Speech not initialized. \"{0}\"", text); } } void ChangeVoice(TextToSpeechVoice voice) { // Change voice? if (voice != TextToSpeechVoice.Default) { // Get name var voiceName = Enum.GetName(typeof(TextToSpeechVoice), voice); // See if it's never been found or is changing if ((voiceInfo == null) || (!voiceInfo.DisplayName.Contains(voiceName))) { // Search for voice info voiceInfo = SpeechSynthesizer.AllVoices.Where(v => v.DisplayName.Contains(voiceName)).FirstOrDefault(); // If found, select if (voiceInfo != null) { synthesizer.Voice = voiceInfo; } else { Debug.LogErrorFormat("TTS voice {0} could not be found.", voiceName); } } } } #endif // WINDOWS_UWP } }
Wrapping Up
That’s pretty much it for this post. If I now add an instance of my TextToSpeechManager to a game component as in the picture below;
where the interval has been set to poll the queue at an (aggressive!) 250ms interval then I find that code like this;
var textToSpeech = this.GetComponent<TextToSpeechManager>(); textToSpeech.SpeakText("One"); textToSpeech.SpeakText("Two"); textToSpeech.SpeakText("Three");
now plays 3 distinct lines of speech rather than one which is what I had been hoping for when I started the post.