Conversations with the Language Understanding (LUIS) Service from Unity in Mixed Reality Apps

I’ve written quite a bit on this blog about speech interactions in the past and elsewhere like these articles that I wrote for the Windows blog a couple of years ago;

Using speech in your UWP apps- It’s good to talk

Using speech in your UWP apps- From talking to conversing

Using speech in your UWP apps- Look who’s talking

which came out of earlier investigations that I did for this blog like this post;

Speech to Text (and more) with Windows 10 UWP & ‘Project Oxford’

and we talked about Speech in our Channel9 show again a couple of years ago now;

image

and so I won’t rehash the whole topic here of speech recognition and understanding but in the last week I’ve been working on a fairly simple scenario that I thought I would share the code from.

Backdrop – the Scenario

The scenario involved a Unity application built against the “Stable .NET 3.5 Equivalent” scripting runtime which targets both HoloLens and immersive Windows Mixed Reality headsets where there was a need to use natural language instructions inside of the app.

That is, there’s a need to;

  1. grab audio from the microphone.
  2. turn the audio into text.
  3. take the text and derive the user’s intent from the spoken text.
  4. drive some action inside of the application based on that intent.

It’s fairly generic although the specific application is quite exciting but in order to get this implemented there’s some choices around technologies/APIs and whether functionality happens in the cloud or at the edge.

Choices

When it comes to (2) there’s a couple of choices in that there are layered Unity/UWP APIs that can make this happen and the preference in this scenario would be to use the Unity APIs which are the KeywordRecognizer and the DictationRecognizer for handling short/long chunks of speech respectively.

Those APIs are packaged so as to wait for a reasonable, configurable period of time for some speech to occur before delivering a ‘speech occurred’ type event to the caller passing the text that has been interpreted from the speech. 

There’s no cost (beyond on-device resources) to using these APIs and so in a scenario which only went as far as speech-to-text it’d be quite reasonable to have these types of APIs running all the time gathering up text and then having the app decide what to do with it.

However, when it comes to (3), the API of choice is LUIS which can take a piece of text like;

“I’d like to order a large pepperoni pizza please”

and can turn it into something like;

Intent: OrderPizza

Entity: PizzaType (Pepperoni)

Entity: Size (Large)

Confidence: 0.85

and so it’s a very useful thing as it takes the task of fathoming all the intricacies of natural language away from the developer.

This poses a bit of a challenge though for a ‘real time’ app in that it’s not reasonable to take every speech utterance that the user delivers and run it through the LUIS cloud service. There’s a number of reasons for that including;

  1. The round-trip time from the client to the service is likely to be fairly long and so, without care, the app would have many calls in flight leading to problems with response time and complicating the code and user experience.
  2. The service has a financial cost.
  3. The user may not expect or want all of their utterances to be run through the cloud.

Consequently, it seems sensible to have some trigger in an app which signifies that the user is about to say something that is of meaning to the app and which should be sent off to the LUIS service for examination. In short, it’s the;

“Hey, Cortana”

type key phrase that lets the system know that the user has something to say.

This can be achieved in a Unity app targeting .NET 3.5 by having the KeywordRecognizer class work in conjunction with the DictationRecognizer class such that the former listens for the speech keyword (‘hey, Cortana!’) and the latter then springs into life and listens for the dictated phrase that the user wants to pass on to the app.

As an aside, it’s worth flagging that these classes are only supported by Unity on Windows 10 as detailed in the docs and that there is an isSupported flag to let the developer test this at runtime.

There’s another aside to using these two classes together in that the docs here note that different types of recognizer cannot be instantiated at once and that they rely on an underlying PhraseRecognitionSystem and that the system has to be Shutdown in order to switch between one type of recognizer and another.

Later on in the post, I’ll return to the idea of making a different choice around turning speech to text but for the moment, I moved forward with the DictationRecognizer.

Getting Something Built

Some of that took a little while to figure out but once it’s sorted it’s “fairly” easy to write some code in Unity which uses a KeywordRecognizer to switch on/off a DictationRecognizer in an event-driven loop so as to gather dictated text.

I chose to have the notion of a DictationSink which is just something that receives some text from somewhere. It could have been an interface but I thought that I’d bring in MonoBehavior;

using UnityEngine;

public class DictationSink : MonoBehaviour
{
    public virtual void OnDictatedText(string text)
    {
    }
}

and so then I can write a DictationSource which surfaces a few properties from the underlying DictationRecognizer and passes on recognized text to a DictationSink;

using System;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class DictationSource : MonoBehaviour
{
    public event EventHandler DictationStopped;

    public float initialSilenceSeconds;
    public float autoSilenceSeconds;
    public DictationSink dictationSink;
   
    // TODO: Think about whether this should be married with the notion of
    // a focused object rather than just some 'global' entity.

    void NewRecognizer()
    {
        this.recognizer = new DictationRecognizer();
        this.recognizer.InitialSilenceTimeoutSeconds = this.initialSilenceSeconds;
        this.recognizer.AutoSilenceTimeoutSeconds = this.autoSilenceSeconds;
        this.recognizer.DictationResult += OnDictationResult;
        this.recognizer.DictationError += OnDictationError;
        this.recognizer.DictationComplete += OnDictationComplete;
        this.recognizer.Start();
    }
    public void Listen()
    {
        this.NewRecognizer();
    }
    void OnDictationComplete(DictationCompletionCause cause)
    {
        this.FireStopped();
    }
    void OnDictationError(string error, int hresult)
    {
        this.FireStopped();
    }
    void OnDictationResult(string text, ConfidenceLevel confidence)
    {
        this.recognizer.Stop();

        if ((confidence == ConfidenceLevel.Medium) ||
            (confidence == ConfidenceLevel.High) &&
            (this.dictationSink != null))
        {
            this.dictationSink.OnDictatedText(text);
        }
    }
    void FireStopped()
    {
        this.recognizer.DictationComplete -= this.OnDictationComplete;
        this.recognizer.DictationError -= this.OnDictationError;
        this.recognizer.DictationResult -= this.OnDictationResult;
        this.recognizer = null;

        // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
        // The challenge we have here is that we want to use both a KeywordRecognizer
        // and a DictationRecognizer at the same time or, at least, we want to stop
        // one, start the other and so on.
        // Unity does not like this. It seems that we have to shut down the 
        // PhraseRecognitionSystem that sits underneath them each time but the
        // challenge then is that this seems to stall the UI thread.
        // So far (following the doc link above) the best plan seems to be to
        // not call Stop() on the recognizer or Dispose() it but, instead, to
        // just tell the system to shutdown completely.
        PhraseRecognitionSystem.Shutdown();

        if (this.DictationStopped != null)
        {
            // And tell any friends that we are done.
            this.DictationStopped(this, EventArgs.Empty);
        }
    }
    DictationRecognizer recognizer;
}

notice in that code my attempt to use PhraseRecognitionSystem.Shutdown() to really stop this recognizer when I’ve processed a single speech utterance from it.

I need to switch this recognition on/off in response to a keyword being spoken by the user and so I wrote a simple KeywordDictationSwitch class which tries to do this using KeywordRecognizer with a few keywords;

using System.Linq;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class KeywordDictationSwitch : MonoBehaviour
{
    public string[] keywords = { "ok", "now", "hey", "listen" };
    public DictationSource dictationSource;

    void Start()
    {
        this.NewRecognizer();
        this.dictationSource.DictationStopped += this.OnDictationStopped;
    }
    void NewRecognizer()
    {
        this.recognizer = new KeywordRecognizer(this.keywords);
        this.recognizer.OnPhraseRecognized += this.OnPhraseRecgonized;
        this.recognizer.Start();
    }
    void OnDictationStopped(object sender, System.EventArgs e)
    {
        this.NewRecognizer();
    }
    void OnPhraseRecgonized(PhraseRecognizedEventArgs args)
    {
        if (((args.confidence == ConfidenceLevel.Medium) ||
            (args.confidence == ConfidenceLevel.High)) &&
            this.keywords.Contains(args.text.ToLower()) &&
            (this.dictationSource != null))
        {
            this.recognizer.OnPhraseRecognized -= this.OnPhraseRecgonized;
            this.recognizer = null;

            // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
            // The challenge we have here is that we want to use both a KeywordRecognizer
            // and a DictationRecognizer at the same time or, at least, we want to stop
            // one, start the other and so on.
            // Unity does not like this. It seems that we have to shut down the 
            // PhraseRecognitionSystem that sits underneath them each time but the
            // challenge then is that this seems to stall the UI thread.
            // So far (following the doc link above) the best plan seems to be to
            // not call Stop() on the recognizer or Dispose() it but, instead, to
            // just tell the system to shutdown completely.
            PhraseRecognitionSystem.Shutdown();

            // And then start up the other system.
            this.dictationSource.Listen();
        }
        else
        {
            Debug.Log(string.Format("Dictation: Listening for keyword {0}, heard {1} with confidence {2}, ignored",
                this.keywords,
                args.text,
                args.confidence));
        }
    }
    void StartDictation()
    {
        this.dictationSource.Listen();
    }
    KeywordRecognizer recognizer;
}

and once again I’m going through some steps to try and switch the KeywordRecognizer on/off here so that I can then switch the DictationRecognizer on/off as simply calling Stop() on a recognizer isn’t enough.

With this in place, I can now stack these components in Unity and have them use each other;

image

and so now I’ve got some code that listens for keywords, switches dictation on, listens for dictation and then passes that on to some DictationSink.

That’s a nice place to implement some LUIS functionality.

In doing so, I ended up writing perhaps more code than I’d liked as I’m not sure whether there is a LUIS library that works from a Unity environment targeting the Stable .NET 3.5 subset. I’ve found this to be a challenge with calling a few Azure services from Unity and LUIS doesn’t seem to be an exception in that there are client libraries on NuGet for most scenarios but I don’t think that they work in Unity (I could be wrong) and there aren’t generally examples/samples for Unity.

So…I rolled some small pieces of my own here which isn’t so hard when the call that we need here with LUIS is just a REST call.

Based on the documentation around the most basic “GET” functionality as detailed in the LUIS docs here,  I wrote some classes to represent the LUIS results;

using System;
using System.Linq;

namespace LUIS.Results
{
    [Serializable]
    public class QueryResultsIntent
    {
        public string intent;
        public float score;
    }
    [Serializable]
    public class QueryResultsResolution
    {
        public string[] values;

        public string FirstOrDefaultValue()
        {
            string value = string.Empty;
            
            if (this.values != null)
            {
                value = this.values.FirstOrDefault();
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResultsEntity
    {
        public string entity;
        public string type;
        public int startIndex;
        public int endIndex;
        public QueryResultsResolution resolution;

        public string FirstOrDefaultResolvedValue()
        {
            var value = string.Empty;

            if (this.resolution != null)
            {
                value = this.resolution.FirstOrDefaultValue();
            }

            return (value);
        }
        public string FirstOrDefaultResolvedValueOrEntity()
        {
            var value = this.FirstOrDefaultResolvedValue();

            if (string.IsNullOrEmpty(value))
            {
                value = this.entity;
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResults
    {
        public string query;
        public QueryResultsEntity[] entities;
        public QueryResultsIntent topScoringIntent;
    }
}

and then wrote some code to represent a Query of the LUIS service. I wrote this on top of pieces that I borrowed from my colleague, Dave’s, repo over here in github which provides some Unity compatible REST pieces with JSON serialization etc.

using LUIS.Results;
using RESTClient;
using System;
using System.Collections;

namespace LUIS
{
    public class Query
    {
        string serviceBaseUrl;
        string serviceKey;

        public Query(string serviceBaseUrl,
            string serviceKey)
        {
            this.serviceBaseUrl = serviceBaseUrl;
            this.serviceKey = serviceKey;
        }
        public IEnumerator Get(Action<IRestResponse<QueryResults>> callback)
        {
            var request = new RestRequest(this.serviceBaseUrl, Method.GET);

            request.AddQueryParam("subscription-key", this.serviceKey);
            request.AddQueryParam("q", this.Utterance);
            request.AddQueryParam("verbose", this.Verbose.ToString());
            request.UpdateRequestUrl();

            yield return request.Send();

            request.ParseJson<QueryResults>(callback);
        }        
        public bool Verbose
        {
            get;set;
        }
        public string Utterance
        {
            get;set;
        }
    }
}

and so now I can Query LUIS and get results back and so it’s fairly easy to put this into a DictationSink which passes the dictated speech in text form off to LUIS;

using LUIS;
using LUIS.Results;
using System;
using System.Linq;
using UnityEngine.Events;

[Serializable]
public class QueryResultsEventType : UnityEvent<QueryResultsEntity[]>
{
}

[Serializable]
public class DictationSinkHandler
{
    public string intentName;
    public QueryResultsEventType intentHandler;
}

public class LUISDictationSink : DictationSink
{
    public float minimumConfidenceScore = 0.5f;
    public DictationSinkHandler[] intentHandlers;
    public string luisApiEndpoint;
    public string luisApiKey;

    public override void OnDictatedText(string text)
    {
        var query = new Query(this.luisApiEndpoint, this.luisApiKey);

        query.Utterance = text;

        StartCoroutine(query.Get(
            results =>
            {
                if (!results.IsError)
                {
                    var data = results.Data;

                    if ((data.topScoringIntent != null) &&
                        (data.topScoringIntent.score > this.minimumConfidenceScore))
                    {
                        var handler = this.intentHandlers.FirstOrDefault(
                            h => h.intentName == data.topScoringIntent.intent);

                        if (handler != null)
                        {
                            handler.intentHandler.Invoke(data.entities);
                        }
                    }
                }
            }
        ));
    }
}

and this is really just a map which takes a look at the confidence score provided by LUIS, makes sure that it is high enough for our purposes and then looks into a map which maps between the names of the LUIS intents and a function which handles that intent set up here as a UnityEvent<T> so that it can be configured in the editor.

So, in use if I have some LUIS model which has intents named Create, DeleteAll and DeleteType then I can configure up an instance of this LUISDictationSink in Unity as below to map these to functions inside of a class named LUISIntentHandlers in this case;

image

and then a handler for this type of interaction might look something like;

    public void OnIntentCreate(LUIS.Results.QueryResultsEntity[] entities)
    {
        // We need two pieces of information here - the shape type and
        // the distance.
        var entityShapeType = entities.FirstOrDefault(e => e.type == "shapeType");
        var entityDistance = entities.FirstOrDefault(e => e.type == "builtin.number");

	// ...
    }

and this all works fine and completes the route that goes from;

keyword recognition –> start dictation –> end dictation –> LUIS –> intent + entities –> handler in code –> action

Returning to Choices – Multi-Language & Dictation in the Cloud

I now have some code that works and it feels like the pieces are in the ‘best’ place in that I’m running as much as possible on the device and hopefully only calling the cloud when I need to. That said, if I could get the capabilities of LUIS offline and run then on the device then I’d like to do that too but it’s not something that I think you can do right now with LUIS.

However, there is one limit to what I’m currently doing which isn’t immediately obvious and it’s that it is limited in terms of offering the possibility of non-English languages and, specifically, on HoloLens where (as far as I know) the recognizer classes only offer English support.

So, to support other languages I’d need to do my speech to text work via some other route – I can’t rely on the DictationRecognizer alone.

As an aside, it’s worth saying that I think multi-language support would need more work than just getting the speech to text to work in another language.

I think it would also require building a LUIS model in another language but that’s something that could be done.

An alternate way of performing speech-to-text that does support multiple languages would be to bring in a cloud powered speech to text API like the Cognitive Service Speech API and I could bring that into my code here by wrapping it up as a new type of DictationSource.

That speech-to-text API has some different ways of working. Specifically it can perform speech to text by;

  • Submitting an audio file in a specified format to a REST endpoint and getting back text.
  • Opening a websocket and sending chunks of streamed speech data up to the service to get back responses.

Of the two, the second has the advantage that it can be a bit smarter around detecting silence in the stream and it can also offer interim ‘hypotheses’ around what is being said before it delivers its ultimate view of what the utterance was. It can also support longer sections of speech than the file-based method.

So, this feels like a good way to go as an alternate DictationSource for my code.

However, making use of that API requires sending a stream of audio data to the cloud down a websocket in a format that is compatible with the service on the other end of the wire and that’s code I’d like to avoid writing. Ideally, it feels like the sort of code that one developer who was close to the service would write once and everyone would then re-use.

That work is already done if you’re using the service from .NET and you’re in a situation where you can make use of the client library that wrappers up the service access but I don’t think that it’s going to work for me from Unity when targeting the “Stable .NET 3.5 Equivalent” scripting runtime.

So…for this post, I’m going to leave that as a potential ‘future exercise’ that I will try to return back to if time permits and I’ll update the post if I do so.

In the meantime, here’s the code.

Code

If you’re interested in the code then it’s wrapped up in a simple Unity project that’s here on github;

http://github.com/mtaulty/LUISPlayground

That code is coupled to a LUIS service which has some very basic intents and entities around creating simple Unity game objects (spheres and cubes) at a certain distance in front of the user. It’s very rough.

There are three intents inside of this service. One is intended to create objects with utterances like “I want to create a cube 2 metres away”

image

and then it’s possible to delete everything that’s been created with a simple utterance;

image

and lastly it’s possible to get rid of just the spheres/cubes with a different intent such as “get rid of all the cubes”;

image

If you wanted to make the existing code run then you’d need an API endpoint and a service key for such a service and so I’ve exported the service itself from LUIS as a JSON export into this file in the repo;

image

so it should be possible to go to the LUIS portal and import that as a service;

image

and then plug in the endpoint and service key into the code here;

image

Rough Notes on UWP and webRTC (Part 4–Adding some Unity and a little HoloLens)

Following up on my previous post, I wanted to take the very basic test code that I’d got working ‘reasonably’ on UWP on my desktop PC and see if I could move it to HoloLens running inside of a Unity application.

The intention would be to preserve the very limited functionality that I have which goes something like;

  • The app runs up, is given the details of the signalling service (from the PeerCC sample) to connect to and it then connects to it
  • The app finds a peer on the signalling service and tries to get a two-way audio/video call going with that peer displaying local/remote video and capturing local audio while playing remote audio.

That’s what I currently have in the signalling branch here and the previous blog post was about abstracting some of that out such that I could use it in a different environment like Unity.

Now it’s time to see if that works out…

Getting Some Early Bits to Work With

In order to even think about this I needed to try and pick up a version of UWP webRTC that works in that environment and which has some extra pieces to help me out and, as far as I know, at the time of writing that involves picking up bits that are mentioned in this particular issue over on github by the UWP webRTC team;

Expose “raw” MediaCapture Object #11

and there are instructions in that post around how to get hold of some bits;

Instructions for Getting Bits

and so I followed those instructions and built the code from that branch of that repo.

From there, I’ve been working with my colleague Pete to put together some of those pieces with the pieces that I already had from the previous blog posts.

First, a quick look around the bits that the repo gives us…

Exploring the WebRtcUnity PeerCC Sample Solution

As is often the case, this process looks like it is going to involve standing on the shoulder of some other giants because there’s already code in the UWP webRTC repo that I pointed to above that shows how to put this type of app together.

The code in question is surfaced through this solution in the repo;

image

Inside of that solution, there’s a project which builds out the equivalent of original XAML+MediaElement PeerCC sample but in a modified way which here doesn’t have to use MediaElement to render and that shift in the code here is represented by its additional Unity dependencies;

image

This confused me for a little while – I was wondering why this XAML based application suddenly had a big dependency on Unity until I realised that what’s been done here to show that media can be rendered by Unity is that the original sample code has been modified such that (dependent on the conditional compilation constant UNITY) this app can either render media streams;

  1. Using MediaElement as it did previously
  2. Using Unity rendering pieces which are then hosted inside of a SwapChainPanel inside of the XAML UI.

Now, I’ve failed to get this sample to run on my machine which I think is down to the versions of Unity that I’m running and so I had to go through a process of picking through the code a little ‘cold’ but in so far as I can see what goes on here is that there are a couple of subprojects involved in making this work…

The Org.WebRtc.Uwp Project

This project was already present in the original XAML-based solution and in my mind this is involved with wrapping some C++/CX code around the webrtc.lib library in order to bring types into a UWP environment. I haven’t done a delta to try and see how much/little is different in this branch of this project over the original sample so there may be differences.

image

The MediaEngineUWP and WebRtcScheme Projects

Then there’s 2 projects within the Unity sample’s MediaEngine folder which I don’t think were present in the original purely XAML-based PeerCC sample;

image

The MediaEngineUWP and WebRtcScheme projects build out DLLs which seem to take on a couple of roles although I’m more than willing to admit that I don’t have this all worked out in my head at the time of writing but I think they are about bridging between the Unity platform, the Windows Media platform and webRTC and I think they do this by;

  • The existing work in the Org.WebRtc.Uwp project which integrates webRTC pieces into the Windows UWP media pipeline. I think this is done by adding a webRTC VideoSinkInterface which then surfaces the webRTC pieces as the UWP IMediaSource and IMediaStreamSource types.
  • The MediaEngineUWP.dll having an export UnityPluginLoad function which grabs an IUnityGraphics and offers a number of other exports that can be called via PInvoke from Unity to set up the textures for local/remote video rendering of video frames in Unity by code inside of this DLL.
    • There’s a class in this project named MediaEnginePlayer which is instanced per video stream and which seems to do the work of grabbing frames from the incoming Windows media pipeline and transferring them into Unity textures.
    • The same class looks to use the IMFMediaEngineNotify callback interface to be notified of state changes for the media stream and responds by playing/stopping etc.

The wiring together of this MediaEnginePlayer into the media pipeline is a little opaque to me but I think that it follows what is documented here and under the topic Source Resolver here. This seems to involve the code associating a URL (of form webrtc:GUID) with each IMediaStream and having an activatable class which the media pipeline then invokes with the URL to be linked up to the right instance of the player.

That may be a ‘much less than perfect’ description of what goes on in these projects as I haven’t stepped through all of that code.

What I think it does mean though is that the code inside of the WebRtcScheme project requires that the .appxmanifest for an app that consumes it needs to include a section that looks like;

 <Extensions>
    <Extension Category="windows.activatableClass.inProcessServer">
      <InProcessServer>
        <Path>WebRtcScheme.dll</Path>
        <ActivatableClass ActivatableClassId="WebRtcScheme.SchemeHandler" ThreadingModel="both" />
      </InProcessServer>
    </Extension>
  </Extensions>

I don’t know of a way of setting this up inside of a Unity project so I ended up just letting Unity build the Visual Studio solution and then I manually hack the manifest to include this section

Exploring the Video Control Solution

I looked into another project within that github repo which is a Unity project contained within this folder;

image

There’s a Unity scene which has a (UI) Canvas and a couple of Unity Raw Image objects which can be used to render to;

image

and a Control script which is set up to PInvoke into the MediaEngineUWP to pass the pieces from the Unity environment into the DLL. That script looks like this;

using System;
using System.Runtime.InteropServices;
using UnityEngine;
using UnityEngine.UI;

#if !UNITY_EDITOR
using Org.WebRtc;
using Windows.Media.Core;
#endif

public class ControlScript : MonoBehaviour
{
    public uint LocalTextureWidth = 160;
    public uint LocalTextureHeight = 120;
    public uint RemoteTextureWidth = 640;
    public uint RemoteTextureHeight = 480;
    
    public RawImage LocalVideoImage;
    public RawImage RemoteVideoImage;

	void Awake()
    {
    }
    
    void Start()
    {
	}

    private void OnInitialized()
    {
    }

    private void OnEnable()
    {
    }

    private void OnDisable()
    {
    }

    void Update()
    {
    }

    public void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(LocalTextureWidth, LocalTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)LocalTextureWidth, (int)LocalTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        LocalVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadLocalMediaStreamSource((MediaStreamSource)source);
        Plugin.LocalPlay();
#endif
    }

    public void DestroyLocalMediaStreamSource()
    {
        LocalVideoImage.texture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    public void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetRemotePrimaryTexture(RemoteTextureWidth, RemoteTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)RemoteTextureWidth, (int)RemoteTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        RemoteVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadRemoteMediaStreamSource((MediaStreamSource)source);
        Plugin.RemotePlay();
#endif
    }

    public void DestroyRemoteMediaStreamSource()
    {
        RemoteVideoImage.texture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }

    private static class Plugin
    {
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateLocalMediaPlayback")]
        internal static extern void CreateLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateRemoteMediaPlayback")]
        internal static extern void CreateRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseLocalMediaPlayback")]
        internal static extern void ReleaseLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseRemoteMediaPlayback")]
        internal static extern void ReleaseRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetLocalPrimaryTexture")]
        internal static extern void GetLocalPrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetRemotePrimaryTexture")]
        internal static extern void GetRemotePrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

#if !UNITY_EDITOR
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadLocalMediaStreamSource")]
        internal static extern void LoadLocalMediaStreamSource(MediaStreamSource IMediaSourceHandler);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadRemoteMediaStreamSource")]
        internal static extern void LoadRemoteMediaStreamSource(MediaStreamSource IMediaSourceHandler);
#endif

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPlay")]
        internal static extern void LocalPlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePlay")]
        internal static extern void RemotePlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPause")]
        internal static extern void LocalPause();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePause")]
        internal static extern void RemotePause();
    }
}

and so it’s essentially giving me the pieces that I need to wire up local/remote media streams coming from webRTC into the pieces that can render them in Unity.

If feels like across these projects are the pieces needed to plug together with my basic library project in order to rebuild the app that I had in the previous blog post and have it run inside of a 3D Unity app rather than a 2D XAML app…

Plugging Together the Pieces

Pete put together a regular Unity project targeting UWP for HoloLens and in the scene at the moment we have only 2 quads that we try to render the local and remote video to.

image

and then there’s an empty GameObject named Control with a script on it configured as below;

image

and you can see that this configuration is being used to do a couple of things;

  • Set up the properties that my conversation library code from the previous blog post needed to try and start a conversation over webRTC
    • The signalling server IP address, port number, whether to initiate a conversation or not and, if so, whether there’s a particular peer name to initiate that conversation with.
  • Set up some properties that will facilitate rendering of the video into the materials texturing the 2 quads in the scene.
    • Widths, heights to use.
    • The GameObjects that we want to render our video streams to.

Pete re-worked the original sample code to render to a texture of a material applied to a quad rather than the original rendering to a 2D RawImage.

Now, it’s fairly easy to then add my conversation library into this Unity project so that we can make use of that code. We simply drop it into the Assets of the project and configure up the appropriate build settings for Unity;

image

and also drop in the MediaEngineUWP, Org.WebRtc.dll and WebRtcScheme.dlls;

image

and the job then becomes one of adapting the code that I wrote in the previous blog post to suit the Unity environment which means being able to implement the IMediaManager interface that I came up with for Unity rather than for XAML.

How to go about that? Firstly, We took those PInvoke signatures from the VideoControlSample and put them into a separate static class named Plugin.

Secondly, we implemented that IMediaManager interface on top of the pieces that originated in the sample;

#if ENABLE_WINMD_SUPPORT

using ConversationLibrary.Interfaces;
using ConversationLibrary.Utility;
using Org.WebRtc;
using System;
using System.Linq;
using System.Threading.Tasks;
using UnityEngine;
using UnityEngine.WSA;
using Windows.Media.Core;

public class MediaManager : IMediaManager
{
    // This constructor will be used by the cheap IoC container
    public MediaManager()
    {
        this.textureDetails = CheapContainer.Resolve<ITextureDetailsProvider>();
    }
    // The idea is that this constructor would be used by a real IoC container.
    public MediaManager(ITextureDetailsProvider textureDetails)
    {
        this.textureDetails = textureDetails;
    }
    public Media Media => this.media;

    public MediaStream UserMedia => this.userMedia;

    public MediaVideoTrack RemoteVideoTrack { get => remoteVideoTrack; set => remoteVideoTrack = value; }

    public async Task AddLocalStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateLocalMediaStreamSource(track, LOCAL_VIDEO_FRAME_FORMAT, "SELF"));
        }
    }

    public async Task AddRemoteStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateRemoteMediaStreamSource(track, REMOTE_VIDEO_FRAME_FORMAT, "PEER"));
        }
    }
    void InvokeOnUnityMainThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnAppThread(callback,false);
    }
    void InvokeOnUnityUIThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnUIThread(callback, false);
    }
    public async Task CreateAsync(bool audioEnabled = true, bool videoEnabled = true)
    {
        this.media = Media.CreateMedia();

        // TODO: for the moment, turning audio off as I get an access violation in
        // some piece of code that'll take some debugging.
        RTCMediaStreamConstraints constraints = new RTCMediaStreamConstraints()
        {
            // TODO: switch audio back on, fix the crash.
            audioEnabled = false,
            videoEnabled = true
        };
        this.userMedia = await media.GetUserMedia(constraints);
    }

    public void RemoveLocalStream()
    {
        // TODO: is this ever getting called?
        this.InvokeOnUnityMainThread(
            () => this.DestroyLocalMediaStreamSource());
    }

    public void RemoveRemoteStream()
    {
        this.DestroyRemoteMediaStreamSource();
    }

    public void Shutdown()
    {
        if (this.media != null)
        {
            if (this.localVideoTrack != null)
            {
                this.localVideoTrack.Dispose();
                this.localVideoTrack = null;
            }
            if (this.RemoteVideoTrack != null)
            {
                this.RemoteVideoTrack.Dispose();
                this.RemoteVideoTrack = null;
            }
            this.userMedia = null;
            this.media.Dispose();
            this.media = null;
        }
    }
    void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr playbackTexture = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(
            this.textureDetails.Details.LocalTextureWidth, 
            this.textureDetails.Details.LocalTextureHeight, 
            out playbackTexture);

        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = 
            (Texture)Texture2D.CreateExternalTexture(
                (int)this.textureDetails.Details.LocalTextureWidth, 
                (int)this.textureDetails.Details.LocalTextureHeight, 
                (TextureFormat)14, false, false, playbackTexture);

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadLocalMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.LocalPlay();
    }

    void DestroyLocalMediaStreamSource()
    {
        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();

        IntPtr playbackTexture = IntPtr.Zero;

        Plugin.GetRemotePrimaryTexture(
            this.textureDetails.Details.RemoteTextureWidth, 
            this.textureDetails.Details.RemoteTextureHeight, 
            out playbackTexture);

        // NB: creating textures and calling GetComponent<> has thread affinity for Unity
        // in so far as I can tell.
        var texture = (Texture)Texture2D.CreateExternalTexture(
           (int)this.textureDetails.Details.RemoteTextureWidth,
           (int)this.textureDetails.Details.RemoteTextureHeight,
           (TextureFormat)14, false, false, playbackTexture);

        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = texture;

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadRemoteMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.RemotePlay();
    }

    void DestroyRemoteMediaStreamSource()
    {
        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }
    Media media;
    MediaStream userMedia;
    MediaVideoTrack remoteVideoTrack;
    MediaVideoTrack localVideoTrack;
    ITextureDetailsProvider textureDetails;

    // TODO: temporary hard coding...
    static readonly string LOCAL_VIDEO_FRAME_FORMAT = "I420";
    static readonly string REMOTE_VIDEO_FRAME_FORMAT = "H264";
}
#endif

Naturally, this is very “rough” code right now and there’s some hard-coding going on in there but it didn’t take too much effort to plug these pieces under that interface that I’d brought across from my original, minimal XAML-based project.

So…with all of that said…

Does It Work?

Sort of Smile Firstly, you might notice in the code above that audio is hard-coded to be switched off because we currently have a crash if we switch audio on and it’s some release of some smart pointer in the webRTC pieces that we haven’t yet tracked down.

Minus audio, it’s possible to run the Unity app here on HoloLens and have it connect via the sample-provided signalling service to the original XAML-based PeerCC sample running (e.g.) on my Surface Book and video streams flow and are visible in both directions.

Here’s a screenshot of that “in action” from the point of view of the desktop app receiving video stream from HoloLens;

image

and that screenshot is display 4 things;

  • Bottom right is the local PC’s video stream off its webcam – me wearing a HoloLens.
  • Upper left 75% is the remote stream coming from the webcam on the HoloLens including its holographic content which currently includes;
    • Upper left mid section is the remote video stream from the PC replayed on the HoloLens.
    • Upper right mid section is the local HoloLens video stream replayed on the HoloLens which looked to disappear when I was taking this screenshot.

You might see some numbers in there that suggest 30fps but I think that was a temporary thing and at the time of writing the performance so far is fairly bad but we’ve not had any look at what’s going on there just yet – this ‘play’ sample needs some more investigation.

Where’s the Code?

If you’re interested in following these experiments along as we go forward then the code is in a different location to the previous repo as it’s over here on Pete’s github account;

https://github.com/peted70/web-rtc-test

Feel free to feedback but, of course, apply the massive caveats that this is very rough experimentation at the moment – there’s a long way to go Smile

Rough Notes on UWP and webRTC (Part 3)

This is a follow-on from my previous post around taking small steps with webRTC and UWP.

At the end of that post, I had some scrappy code which was fairly fixed in function in that it was a small UWP app which would use the UWP webRTC library to connect to a signalling service and then could begin a conversation with a peer that was also connected to the same signalling service.

The signalling service in question had to be the one provided with the UWP webRTC bits and the easiest way to test that my app was doing something was to run it against the PeerCC sample which also ships with the UWP webRTC bits and does way more than my app does by demonstrating lots of functionality that’s present in UWP webRTC.

The links to all the webRTC pieces that I’m referring to are in the previous 2 posts on this topic.

Tidying Up

The code that I had in the signalling branch of this github repo at the end of the previous post was quite messy and not really in a position to be re-used and so I spent a little time just pulling that code apart, refactoring some of the functionality behind interfaces and reducing the implicit dependencies in order to try and move the code towards being a little bit more re-usable (even if the functionality it currently implements isn’t of much actual use to a real user – I’m just experimenting).

What I was trying to move towards was some code that I knew sort of worked in this XAML based UWP app that I could then lift out of the app and re-use in a non-XAML based UWP app (i.e. a Unity app) so that I would have some control over the knowns and unknowns in trying out that process.

What I needed to do then was make sure that in refactoring things, I ended up with code that was clearly abstracted from its dependencies on anything in the XAML layer.

Firstly, I refactored the solution into two projects to make for a class library and an app project which referenced it;

image

and then I took some of the pieces of functionality that I had in there and abstracted it out into a set of interfaces;

image

with a view to making the dependencies between these interfaces explicit and the implementation pluggable.

This included putting the code which provides signalling by invoking the signalling service supplied with the original sample behind an interface. Note that I’m not at all trying to come up with a generic interface that could generally represent the notion of signalling in webRTC but, instead, I’m just trying to put an interface on to the existing signalling code that I took (almost) entirely from the PeerCC sample project in the UWP webRTC bits.

image

The other interfaces/services that I added here are hopefully named ‘reasonably well’ in terms of the functionality that they represent with perhaps the one that’s not quite so obvious obvious being the IConversationManager.

This interface is just my attempt to codify the minimum functionality that I need to bring the other interface implementations together in order to get any kind of conversation over webRTC up and running from my little sample app as it stands and that IConversationManager interface right now just looks as below;

image

and so the idea here is that a consumer of an IConversationManager can simply;

  • Tell the manager whether it is meant to initiate conversations or simply wait for a remote peer to being a conversation with it
    • In terms of initiating conversations – the code is ‘aggressive’ in that it simply finds the first peer that it sees provided by the signalling service and attempts to being a conversation with it.
  • Call InitialiseAsync providing the name that the local peer wants to be represented by.
  • Call ConnectToSignallingAsync with the IP Address and port where the signalling service is to be found.

From there, the implementation jumps in and tries to bring together all the right pieces to get a conversation flowing.

In making these abstractions, I found two places where I had to apply a little bit of thought and that was where;

  • The UWP webRTC pieces need initialising with a Dispatcher object and so I abstracted that out into an interface so that an implementation can be injected into the underlying layer.
  • There is a need at some point to do some work with UI objects to represent media streams. In the code to date, this has meant working with XAML MediaElements but in other scenarios (e.g. Unity UI) that wouldn’t work.

In order to try and abstract the library code from these media pieces, I made an IMediaManager interface with the intention being to write a different implementation for the different UI layers so to bring this library up inside of a Unity app I’d at least need to provide a Unity version of the highlighted implementation pieces below which are about IMediaManager in a XAML UI world;

image

My main project took a dependency on autofac to provide a container from which to serve up the implementations of my interfaces and I did a cheap trick of providing my own “container” embedded into the library and named CheapContainer in case the library was going to be used in a situation where autofac or some other IoC container wasn’t immediately available.

Configuration of the container then moves into my App.xaml.cs file and is fairly simple and I wrote it twice, once for autofac and once using my own CheapContainer;

#if !USE_CHEAP_CONTAINER
        Autofac.IContainer Container
        {
            get
            {
                if (this.iocContainer == null)
                {
                    this.BuildContainer();
                }
                return (this.iocContainer);
            }
        }
#endif
        void BuildContainer()
        {
#if USE_CHEAP_CONTAINER
            CheapContainer.Register<ISignallingService, Signaller>();
            CheapContainer.Register<IDispatcherProvider, XamlMediaElementProvider>();
            CheapContainer.Register<IXamlMediaElementProvider, XamlMediaElementProvider>();
            CheapContainer.Register<IMediaManager, XamlMediaElementMediaManager>();
            CheapContainer.Register<IPeerManager, PeerManager>();
            CheapContainer.Register<IConversationManager, ConversationManager>();
#else
            var builder = new ContainerBuilder();
            builder.RegisterType<Signaller>().As<ISignallingService>().SingleInstance();

            builder.RegisterType<XamlMediaElementProvider>().As<IXamlMediaElementProvider>().As<IDispatcherProvider>().SingleInstance();

            builder.RegisterType<XamlMediaElementMediaManager>().As<IMediaManager>().SingleInstance();
            builder.RegisterType<PeerManager>().As<IPeerManager>().SingleInstance();
            builder.RegisterType<ConversationManager>().As<IConversationManager>().SingleInstance();
            builder.RegisterType<MainPage>().AsSelf().SingleInstance();
            this.iocContainer = builder.Build();
#endif
        }
#if USE_CHEAP_CONTAINER
#else
        Autofac.IContainer iocContainer;
#endif

and the code which now lives inside of my MainPage.xaml.cs file involved in actually getting the webRTC conversation up and running is reduced down to almost nothing;

        async void OnConnectToSignallingAsync()
        {
            await this.conversationManager.InitialiseAsync(this.addressDetails.HostName);

            this.conversationManager.IsInitiator = this.isInitiator;

            this.HasConnected = await this.conversationManager.ConnectToSignallingAsync(
                this.addressDetails.IPAddress, this.addressDetails.Port);
        }

and so that seems a lot simpler, neater and more re-usable than what I’d had at the end of the previous blog post.

In subsequent posts, I’m going to see if I can now re-use this library inside of other environments (e.g. Unity) so as to bring this same (very limited) webRTC functionality that I’ve been playing with to that environment.