Windows 10, UWP, IoT Core, SpeechSynthesizer, Raspberry PI and ‘Audio Popping’

My reader mailed in with a query about speech synthesis on Windows 10 and the Universal Windows Platform.

Essentially, they were doing something similar to things that I’d shown in demos in this Channel9 show about speech;

image

and in this article on the Windows blog;

Using speech in your UWP apps: It’s good to talk

and the core of the code was to synthesize various pieces of text to speech and then play them one after another – something like the sample code below which I made to try and reproduce the situation and it’s an event  handler taken from a fairly blank UWP application;


async void OnLoaded(object sender, RoutedEventArgs args)
    {
      using (var synth = new SpeechSynthesizer())
      {
        using (var mediaPlayer = new MediaPlayer())
        {
          TaskCompletionSource<bool> source = null;

          mediaPlayer.MediaEnded += (s, e) =>
          {
            source.SetResult(true);
          };
          for (int i = 0; i < 100; i++)
          {
            var speechText = $"This is message number {i + 1}";
            source = new TaskCompletionSource<bool>();

            using (var speechStream = await synth.SynthesizeTextToStreamAsync(speechText))
            {
              mediaPlayer.Source = MediaSource.CreateFromStream(speechStream, speechStream.ContentType);
              mediaPlayer.Play();              
            }
            await source.Task;
            await Task.Delay(1000);
          }
        }
      }
    }

Now, if I run that code on my PC then everything works as I would expect – I get 100 spoken messages separated by at least 1 second of silence.

However, as my reader pointed out – if I run this on Windows IoT Core on Raspberry PI (2 or 3) then each spoken message is preceded by a popping sound on the audio and it’s not something that you’d want to listen to in a real-world scenario.

I hadn’t come across this before and so did a bit of searching around and found this thread on the MSDN forums;

Clicking sound during start and stop of audio playback

and the upshot of that thread seems to be that there’s an idea that this problem is caused by an issue in the firmware on the Raspberry PI that’s not going to be fixed and so there doesn’t seem to really be a solution there.

The thread does, though, suggest that this problem might be mitigated by using the AudioGraph APIs instead of using MediaPlayer as I’ve done in my code snippet above.

That proves to be a little more tricky though because the AudioGraph APIs seem to allow you to construct inputs from;

and I don’t see an obvious way in which any of these can be used to model a stream of data which is what I get back when I perform Text To Speech using the SpeechSynthesizer class.

The only way to proceed would appear to be to copy the speech stream into some file stream and then have an AudioFileInputNode reading from that stream.

With that in mind, I tried to write code which would;

  1. Create a temporary file
  2. Create an audio graph consisting of a connection between
    1. An AudioFileInputNode representing my temporary file
    2. An AudioDeviceOutputNode for the default audio rendering device on the system
  3. Perform Text to Speech
  4. Write the resulting stream to the temporary file
  5. Have the AudioGraph notice that the input file had been written to, thereby causing it to play the media from that file out of the default audio rendering device on the system

and my aim here was to avoid;

  1. having to recreate either the entire AudioGraph or any of the two input/output nodes within it for each piece of speech
  2. having to create a separate temporary file for every piece of speech
  3. having to create an ever-growing temporary file containing all the pieces of speech concatenated together

and I had hoped to be able to rely on the ability of nodes in an AudioGraph (and the graph itself) all having Start/Stop/Reset methods.

In practice, I’ve yet to get this to really work. I can happily get an AudioInputFileNode to play audio from a file out through its connected output node. However, once that input node has finished playing I don’t seem to be able to find any combination of Start/Stop/Reset/Seek which will get it to play subsequent audio that might arrive in the file by my code altering the file contents.

The closest that I’ve got to working code is what follows below where I create new AudioFileInputNode instances for each piece of speech that is to be spoken.

    async void OnLoaded(object sender, RoutedEventArgs args)
    {
      var temporaryFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      using (var speechSynthesizer = new SpeechSynthesizer())
      {
        var graphResult = await AudioGraph.CreateAsync(new AudioGraphSettings(AudioRenderCategory.Media));

        if (graphResult.Status == AudioGraphCreationStatus.Success)
        {
          using (var graph = graphResult.Graph)
          {
            var outputResult = await graph.CreateDeviceOutputNodeAsync();

            if (outputResult.Status == AudioDeviceNodeCreationStatus.Success)
            {
              graph.Start();

              using (var outputNode = outputResult.DeviceOutputNode)
              {
                for (int i = 0; i < 100; i++)
                {
                  var speechText = $"This is message number {i + 1}";

                  await speechSynthesizer.SynthesizeTextToFileAsync(speechText, temporaryFile);

                  // TBD: I want to avoid this creating of 100 input file nodes but
                  // I don't seem (yet) to be able to get away from it so right now
                  // I keep creating new input nodes over the same file which changes
                  // every iteration of the loop.
                  var inputResult = await graph.CreateFileInputNodeAsync(temporaryFile);

                  if (inputResult.Status == AudioFileNodeCreationStatus.Success)
                  {
                    using (var inputNode = inputResult.FileInputNode)
                    {
                      inputNode.AddOutgoingConnection(outputNode);
                      await inputNode.WaitForFileCompletedAsync();
                    }
                  }
                  await Task.Delay(1000);
                }
              }
              graph.Stop();
            }
          }
        }
      }
      await temporaryFile.DeleteAsync();
    }

and that code depends on a class that can create temporary files;

  public static class TemporaryFileCreator
  {
    public static async Task<StorageFile> CreateTemporaryFileAsync()
    {
      var fileName = $"{Guid.NewGuid()}.bin";

      var storageFile =
        await ApplicationData.Current.TemporaryFolder.CreateFileAsync(fileName);

      return (storageFile);
    }
  }

and also on an extension to the SpeechSynthesizer which will take the speech and write it to a file;

 public static class SpeechSynthesizerExtensions
  {
    public static async Task<StorageFile> SynthesizeTextToTemporaryFileAsync(this SpeechSynthesizer synthesizer, string text)
    {
      var storageFile = await TemporaryFileCreator.CreateTemporaryFileAsync();

      await SynthesizeTextToFileAsync(synthesizer, text, storageFile);

      return (storageFile);
    }
    public static async Task SynthesizeTextToFileAsync(this SpeechSynthesizer synthesizer, string text, StorageFile file)
    {
      using (var speechStream = await synthesizer.SynthesizeTextToStreamAsync(text))
      {
        using (var fileStream = await file.OpenAsync(FileAccessMode.ReadWrite))
        {
          await RandomAccessStream.CopyAndCloseAsync(speechStream, fileStream);
        }
      }
    }
  }

and also on an extension to the AudioFileInputNode class which takes the FileCompleted event that it fires and turns it into something that can be awaited;

  public static class AudioInputFileNodeExtensions
  {
    public static async Task WaitForFileCompletedAsync(this AudioFileInputNode inputNode)
    {
      TypedEventHandler<AudioFileInputNode, object> handler = null;
      TaskCompletionSource<bool> completed = new TaskCompletionSource<bool>();

      handler = (s, e) =>
      {
        s.FileCompleted -= handler;
        completed.SetResult(true);
      };
      inputNode.FileCompleted += handler;

      await completed.Task;
    }
  }

This code seems to work fine on both PC and Raspberry PI but I find that on Raspberry PI I still get an audible ‘pop’ when the code first starts up but I don’t then get an audible ‘pop’ for every piece of speech – i.e. it feels like the situation is improved but not perfect. I’d ideally like to;

  • Get rid of the code that ends up creating N AudioFileInputNode instances rather than 1 and somehow make the Start/Stop/Reset/Seek approach work.
  • Get rid of that initial audible ‘pop’.

I’ll update the post if I manage to come up with a better solution and do feel very free to add comments below if you know of either a solution to the original problem or a better solution than I’ve found to date…

Hitchhiking the HoloToolkit-Unity, Leg 9–Holes in the Walls

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

I must admit that two words which can sometimes strike fear into my heart are the words;

Case Study Winking smile

I’m partially kidding but I’m not a huge fan of case studies which can sometimes be fairly dry write-ups of the form;

“Company C took technology T and solved problem P in time T and saved D dollars”

Of course, that sort of stuff is really important and it is always going to depend on the write-up but I don’t generally put reading case studies to the top of my to-do list.

Against that backdrop, I’ve been very pleasantly surprised by the really interesting HoloLens developer case studies that are published on this site;

HoloLens Developer Case Studies

and I’ve been working my way through them because they are much more at the level of;

“developer wanted to achieve X, this is how they went about it and the challenges they came across in doing it”

and one of the entries that I read quite a long time ago was this one about how to make holes in your walls, ceilings and floors;

“Case study – Looking through holes in your reality”

and it’s brilliant because this idea of being able to “look through surfaces” in HoloLens apps is one of the things that I’ve found to be truly magical across apps like Fragments, RoboRaid and others and the technique isn’t really a complicated one so it’s great to see some of the magic revealed here.

If you haven’t seen aliens coming through your walls in an app like RoboRaid then there’s a video here which shows it in action and also talks a little about what the device and the app developer are doing to pull off the illusion;

Inside of the HoloToolkit, there are some pieces that can help with experimenting with this type of effect and so I went off and got the toolkit (as per this video) and I brought the sections Build, Input, UI, Utilities and SpatialMapping into my project and set it up (as per this video).

Within the Utilities section of the toolkit there is a pre-baked scene called WindowOcclusion;

image

and that has a camera, 4 quads and 5 cubes set up to provide the ‘looking through a window’ effect that the Case Study talks about;

image

and those 4 quads are kind of ‘interesting’ in that they aren’t immediately visible here;

image

but they are being shaded so as to occlude the content behind them;

image

and so the user gets the illusion of looking through the window as it’s essentially the hole left by the 4 quads surrounding it which occlude everything else.

For me, that illusion works best if that window is positioned on a wall whereas this pre-baked scene places the window approx 1.7m in front of wherever the user was looking when the app started. It might be relatively simple though to use spatial mapping and the ‘tap to place’ behaviour to change that and have the user position the window onto a surface.

I thought I’d give that a spin…

Adding Tap to Place

I took the two components from that scene, collected them into an empty game object and then made a prefab from that in Unity called WindowAndContent.

image

and then I add a simple blank placeholder (at the origin) and a quad into my scene 2m in front of the origin. The idea of the quad is to give me something that I can tap and place onto a wall in order to position where I want my window (and the content beyond the window) to appear;

image

and so the intended process is going to be something like;

  • Create quad 2m in front of the user.
  • Allow the user to tap on the quad.
  • Have the quad follow the user’s gaze around their walls (this is the tap to place behaviour).
  • Allow the user to tap.
  • Remove the quad and replace it with the WindowAndContent prefab positioned at the same place and oriented the same way.

To be able to position the quad and the window onto a wall involves spatial mapping and so I made sure that I had spatial perception switched on as a UWP capability and I added the SpatialMapping prefab from the HoloToolkity as you can see in the screenshot above.

I then gave my placeholder object a bunch of behaviours in order to facilitate using the Tap to Place script on my quad;

image

and then I added the Tap to Place script to that quad;

image

but I hacked that script ever so slightly in order to change two things;

  1. By default, the script makes the spatial mapping mesh visible when the object has been tapped and is following the user’s gaze but I didn’t want this so I took it out.
  2. I added a line or two of code such that when the object is placed, the script would fire a Placed event.

I wanted that Placed event so that I could add a script to my quad as shown below;

image

and that script handles the Placed event on my modified TapToPlace component in order to try and get rid of the quad and replace it with the WindowAndContent prefab so that the window would appear where the quad had been positioned. There’s probably a better way of doing this but here’s the script that I used;

using HoloToolkit.Unity;
using UnityEngine;

public class QuadScript : MonoBehaviour
{
  public Transform prefab;

  void Start()
  {
    this.tapToPlace = this.GetComponent<TapToPlace>();
    this.tapToPlace.Placed += this.OnPlaced;
  }

  void OnPlaced(object sender, System.EventArgs e)
  {
    // We're done now.
    this.tapToPlace.Placed -= this.OnPlaced;
    this.tapToPlace = null;

    var windowAndContent = Instantiate(prefab);
    windowAndContent.transform.parent = this.transform.parent;
    windowAndContent.transform.localPosition = this.transform.localPosition;
    windowAndContent.transform.forward = this.transform.forward;
    this.GetComponent<MeshRenderer>().enabled = false;
  }
  TapToPlace tapToPlace;
}

and, sure enough, I can now display a quad and then tap to position it on a wall in my environment and when I tap again it gets replaced with a window that I can look through into a ‘virtual world’ on the other side of that wall.

But that ‘virtual world’ is just a cube, I need something more interesting when I look through my window…

Making the View from the Window more Interesting

I figured that I’d make the window a bit larger and then would look at seeing if I could put some more interesting content on the other side of it.

I went out to the Unity Asset Store and found this set of town models and materials;

image

and it came with a nice scene or two demonstrating lots of the models and so I chopped one of those down in terms of the size of the scene and positioned it on the other side of my window.

Below is a screenshot of what I ended up with – you can see the scene and the relative position of my ‘window’ positioned such that everything in the scene is ‘in front’ of the window;

image

and this screenshot shows the reverse view of the quads shaded to occlude the buildings from the viewer on the other side of the window so that the window provides the only view;

image

and I baked all of this into my WindowAndContent prefab (replacing the existing cube) such that when I placed my quad on a wall this set of buildings would be instantiated on the other side of my window.

Trying this out, it all works surprisingly well.

What doesn’t work quite so well is showing how this looks in captured screenshots but hopefully the pictures below give an idea of two views through the same window.

It’s a very convincing effect when you’re actually using it – here’s a view taken from the right of the window. Note that you can see that I’ve left a little gap to the upper left of the window frame which needs closing in the Unity designer – i.e. that’s an artefact that I can fix rather than something from the device;

20161231_172822_HoloLens

and here’s a view when I’m standing more to the centre of the window;

20161231_172842_HoloLens

This is such a clever effect – I’m going to experiment some more but I’m also going to work way my through some more of those case studies, there’s a lot to learn…

HoloLens, Unity and Recognition with Vuforia (Part 2)

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

Following up from this post;

HoloLens, Unity and Recognition with Vuforia (Part 1)

I wanted to see if I could do some type of ‘custom’ object recognition using HoloLens and Vuforia starting from scratch rather than starting from the pre-baked Vuforia sample and following the steps outlined here;

Developing-Vuforia-Apps-for-HoloLens

I have this scenario in my head where I could combine HoloLens and Sphero as I did in this post;

Windows 10, UWP and Sphero–Bringing 2D UWP Demo Code to HoloLens

but then bring in Vuforia such that I could try and use Vuforia to locate the Sphero ball within a scene and then that would seem to give me the pieces such that I could have a HoloLens app where the Sphero ball did things like follow the user around the room.

That’s where I’m trying to head but I’m not yet sure whether Vuforia can recognise spheres for me so, in the meantime, I should perhaps just focus on seeing whether I can get a custom use of Vuforia to work – it’s always best to ‘start small’ and work towards the final goal Smile

Picking an Object to Recognise

First off, I needed to decided which object I wanted to recognise in my scene and so I went to the kitchen and found this box of biscuits (it’s Xmas!) which are very nice by the way;

WP_20161228_15_10_22_Pro

and so I thought I’d see if I could get Vuforia to identify that box of biscuits for me.

Creating a Vuforia Target Database

My first step here was to go to Vuforia’s license manager page and make sure that I’d created an app;

image

which I’d done for my previous post and I then went to the ‘Target Manager’ tab and created a target database;

image

and you’re then presented with a choice of whether you are trying to do [device/cloud/VuMark] recognition and so I chose device and then once I’ve got a database, I can add targets to it which are [image/cube/cylinder/object] and so I chose cuboid and provided some details of my biscuits;

image

What I really wasn’t sure of here was what dimensions I was supposed to be using and whether they needed to relate to the real-world size of the box. I read one or two forum posts (like this one) but it still didn’t really leave me with clarity around what I was meant to do in terms of units and my box is 8cm (wide) by 7cm (deep) by 15cm tall but I wasn’t confident that I was telling the Target Manager this correctly.

Having done this part, my biscuit box details are flagged as ‘incomplete’;

image

and so I went and provided more details and then I got a bit bogged down because the uploader seemed to want images that matched the aspect ratios that I had given it – e.g. 8/15 = 0.533 whereas my image was 1417/2701 = 0.524 and so I resized the images a little to try and come closer to the right aspect ratio and the tool ultimately backed down and let me win Winking smile

image

At that point, it looks like I can use the website to download the database containing this set of (1) targets;

image

and the download here gave me a Unity package (called TestDevice) and so it’s time to perhaps move across to Unity and see if I can do something with it.

I need to admit at this point that I went around a ‘loop or two’ coming up with the walkthrough below – it maybe took me 4-6 hours as I was finding that my projects weren’t working and some of that was due to me being new at using Vuforia and some of it was just one of those classic situations where you have one lump of code that works and another lump of code that doesn’t and you’re trying to figure out the delta between the two.

Ultimately, the set of following (seemingly simple!) steps are what dropped out of my experimenting…

Making a Unity Project and Importing Packages

I made a blank HoloLens project in Unity 5.5 and then imported 3 different unity packages as below;

  1. HoloToolkity-Unity : I imported the Build, Input, UI, Spatial Mapping and Utilities pieces.
  2. Vuforia SDK 6-2-6: I imported everything other than the pieces clearly labelled iOS and Android much as I did in my previous blog post.
  3. The unity package called TestDevice that I’d just downloaded from the Vuforia site containing my Shortbread model.

I then set up my project for HoloLens development as I do at the start of this video so as to configure the project and the scene for HoloLens development.

I also made sure that I had switched on the UWP capabilities to allow internet connection, spatial perception and webcam although at the time of writing I’m unsure whether I need them all.

I also made sure that Virtual Reality supported was switched on (as usual);

image

Setting up the Vuforia Configuration

I then went and used the (added) Vuforia menu in Unity to open up the configuration and I changed the highlighted options below;

image

Setting up the Vuforia Camera

I then dragged out the Vuforia prefab ARCamera to my scene and made sure that I’d set it up as recommended by altering the highlighted options below;

image

and so that tells the Vuforia camera about the HoloLens camera.

Setting up a Multi Target Behavior

At this point I got a little stuck Smile

The essence of this was really a ‘category error’ on my part and it was because I thought that I was doing ‘object recognition’ and so I kept adding an ‘Object Target Behavior’ script and attempting to point it at my database and the editor didn’t like me trying to do it;

image

So I went off and read the doc page here and realised that while I may be thinking of my box of biscuits as ‘object recognition’ Vuforia doesn’t really think of it that way and reserves ‘object recognition’ for scenarios where I’ve actually scanned a 3D object (the example on the doc page is a toy).

I then spent a little time wondering whether Vuforia might want me to do ‘image recognition’ but that didn’t seem to fit my cuboid (box of biscuits) scenario and reading this doc page made me realise that Vuforia called this cuboid scenario a ‘multi target’ and so I was supposed to be adding a ‘Multi Target Behavior’ onto my object and things became a bit clearer and I later realised that if I’d read this document beforehand then I might have had an easier time Smile

I dragged out a ‘Multi Target’ prefab onto my scene;

image

and configured it to use my database and my object and switched on the extended tracking;

image

Comparing with the Vuforia Unity Sample

I didn’t get everything working ‘first time’ Smile and so I had reason to compare what I was doing here with the original Vuforia sample that I looked at in my previous blogpost and I noticed that the Vuforia camera had a script on it;

image

seems to set a frame rate down to 30fps. I’m unsure whether this is necessary or not but at the time of writing my demo seems to be running without having to include this script but perhaps it could run better with it? That’s still ‘To Be Determined’ at the time of writing.

First Trial

With that setup I deployed to the device and watched the debug output in Visual Studio to see if I was able to track my Shortbread target.

Note – at this point, I’d actually been around this loop around 4-5 times as I’d messed things up a few times and tried out various different routes before settling on the set of steps that I’ve written up here which new feel fairly simple and obvious compared to some of the things that I somewhat randomly tried to get this ‘hello world’ up and running Smile

And, sure enough, I spotted the debug spew that seemed to show that Vuforia was spotting my box of shortbread;

image

This comes from Default Trackable Event Handler script that comes as part of the MultiTarget prefab;

image

which picks up the TrackableBehavior component and has some default event handlers which swich on/off any child renderers and colliders as the object is tracked/lost and it seemed to make sense to use those to add something into the scene to visualise what Vuforia was tracking.

Highlighting the Tracked Object

I went back into Unity and added a simple cube to surround the Shortbread box making it slightly larger than the biscuit box that it is meant to surround;

image

and I went and borrowed a wireframe shader from the UCLA Game Lab;

image

and used that to shade my cube;

image

and then tried that out on HoloLens, capturing the output below where you can see that the biscuit box is picked up pretty well;

so, in the end, getting that basic scenario up and running wasn’t too difficult at all. I’d like to try something ‘more imaginative’ as a follow on but that’ll have to be in another post…