Rough Notes on UWP and webRTC (Part 4–Adding some Unity and a little HoloLens)

Following up on my previous post, I wanted to take the very basic test code that I’d got working ‘reasonably’ on UWP on my desktop PC and see if I could move it to HoloLens running inside of a Unity application.

The intention would be to preserve the very limited functionality that I have which goes something like;

  • The app runs up, is given the details of the signalling service (from the PeerCC sample) to connect to and it then connects to it
  • The app finds a peer on the signalling service and tries to get a two-way audio/video call going with that peer displaying local/remote video and capturing local audio while playing remote audio.

That’s what I currently have in the signalling branch here and the previous blog post was about abstracting some of that out such that I could use it in a different environment like Unity.

Now it’s time to see if that works out…

Getting Some Early Bits to Work With

In order to even think about this I needed to try and pick up a version of UWP webRTC that works in that environment and which has some extra pieces to help me out and, as far as I know, at the time of writing that involves picking up bits that are mentioned in this particular issue over on github by the UWP webRTC team;

Expose “raw” MediaCapture Object #11

and there are instructions in that post around how to get hold of some bits;

Instructions for Getting Bits

and so I followed those instructions and built the code from that branch of that repo.

From there, I’ve been working with my colleague Pete to put together some of those pieces with the pieces that I already had from the previous blog posts.

First, a quick look around the bits that the repo gives us…

Exploring the WebRtcUnity PeerCC Sample Solution

As is often the case, this process looks like it is going to involve standing on the shoulder of some other giants because there’s already code in the UWP webRTC repo that I pointed to above that shows how to put this type of app together.

The code in question is surfaced through this solution in the repo;

image

Inside of that solution, there’s a project which builds out the equivalent of original XAML+MediaElement PeerCC sample but in a modified way which here doesn’t have to use MediaElement to render and that shift in the code here is represented by its additional Unity dependencies;

image

This confused me for a little while – I was wondering why this XAML based application suddenly had a big dependency on Unity until I realised that what’s been done here to show that media can be rendered by Unity is that the original sample code has been modified such that (dependent on the conditional compilation constant UNITY) this app can either render media streams;

  1. Using MediaElement as it did previously
  2. Using Unity rendering pieces which are then hosted inside of a SwapChainPanel inside of the XAML UI.

Now, I’ve failed to get this sample to run on my machine which I think is down to the versions of Unity that I’m running and so I had to go through a process of picking through the code a little ‘cold’ but in so far as I can see what goes on here is that there are a couple of subprojects involved in making this work…

The Org.WebRtc.Uwp Project

This project was already present in the original XAML-based solution and in my mind this is involved with wrapping some C++/CX code around the webrtc.lib library in order to bring types into a UWP environment. I haven’t done a delta to try and see how much/little is different in this branch of this project over the original sample so there may be differences.

image

The MediaEngineUWP and WebRtcScheme Projects

Then there’s 2 projects within the Unity sample’s MediaEngine folder which I don’t think were present in the original purely XAML-based PeerCC sample;

image

The MediaEngineUWP and WebRtcScheme projects build out DLLs which seem to take on a couple of roles although I’m more than willing to admit that I don’t have this all worked out in my head at the time of writing but I think they are about bridging between the Unity platform, the Windows Media platform and webRTC and I think they do this by;

  • The existing work in the Org.WebRtc.Uwp project which integrates webRTC pieces into the Windows UWP media pipeline. I think this is done by adding a webRTC VideoSinkInterface which then surfaces the webRTC pieces as the UWP IMediaSource and IMediaStreamSource types.
  • The MediaEngineUWP.dll having an export UnityPluginLoad function which grabs an IUnityGraphics and offers a number of other exports that can be called via PInvoke from Unity to set up the textures for local/remote video rendering of video frames in Unity by code inside of this DLL.
    • There’s a class in this project named MediaEnginePlayer which is instanced per video stream and which seems to do the work of grabbing frames from the incoming Windows media pipeline and transferring them into Unity textures.
    • The same class looks to use the IMFMediaEngineNotify callback interface to be notified of state changes for the media stream and responds by playing/stopping etc.

The wiring together of this MediaEnginePlayer into the media pipeline is a little opaque to me but I think that it follows what is documented here and under the topic Source Resolver here. This seems to involve the code associating a URL (of form webrtc:GUID) with each IMediaStream and having an activatable class which the media pipeline then invokes with the URL to be linked up to the right instance of the player.

That may be a ‘much less than perfect’ description of what goes on in these projects as I haven’t stepped through all of that code.

What I think it does mean though is that the code inside of the WebRtcScheme project requires that the .appxmanifest for an app that consumes it needs to include a section that looks like;

 <Extensions>
    <Extension Category="windows.activatableClass.inProcessServer">
      <InProcessServer>
        <Path>WebRtcScheme.dll</Path>
        <ActivatableClass ActivatableClassId="WebRtcScheme.SchemeHandler" ThreadingModel="both" />
      </InProcessServer>
    </Extension>
  </Extensions>

I don’t know of a way of setting this up inside of a Unity project so I ended up just letting Unity build the Visual Studio solution and then I manually hack the manifest to include this section

Exploring the Video Control Solution

I looked into another project within that github repo which is a Unity project contained within this folder;

image

There’s a Unity scene which has a (UI) Canvas and a couple of Unity Raw Image objects which can be used to render to;

image

and a Control script which is set up to PInvoke into the MediaEngineUWP to pass the pieces from the Unity environment into the DLL. That script looks like this;

using System;
using System.Runtime.InteropServices;
using UnityEngine;
using UnityEngine.UI;

#if !UNITY_EDITOR
using Org.WebRtc;
using Windows.Media.Core;
#endif

public class ControlScript : MonoBehaviour
{
    public uint LocalTextureWidth = 160;
    public uint LocalTextureHeight = 120;
    public uint RemoteTextureWidth = 640;
    public uint RemoteTextureHeight = 480;
    
    public RawImage LocalVideoImage;
    public RawImage RemoteVideoImage;

	void Awake()
    {
    }
    
    void Start()
    {
	}

    private void OnInitialized()
    {
    }

    private void OnEnable()
    {
    }

    private void OnDisable()
    {
    }

    void Update()
    {
    }

    public void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(LocalTextureWidth, LocalTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)LocalTextureWidth, (int)LocalTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        LocalVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadLocalMediaStreamSource((MediaStreamSource)source);
        Plugin.LocalPlay();
#endif
    }

    public void DestroyLocalMediaStreamSource()
    {
        LocalVideoImage.texture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    public void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetRemotePrimaryTexture(RemoteTextureWidth, RemoteTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)RemoteTextureWidth, (int)RemoteTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        RemoteVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadRemoteMediaStreamSource((MediaStreamSource)source);
        Plugin.RemotePlay();
#endif
    }

    public void DestroyRemoteMediaStreamSource()
    {
        RemoteVideoImage.texture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }

    private static class Plugin
    {
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateLocalMediaPlayback")]
        internal static extern void CreateLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateRemoteMediaPlayback")]
        internal static extern void CreateRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseLocalMediaPlayback")]
        internal static extern void ReleaseLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseRemoteMediaPlayback")]
        internal static extern void ReleaseRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetLocalPrimaryTexture")]
        internal static extern void GetLocalPrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetRemotePrimaryTexture")]
        internal static extern void GetRemotePrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

#if !UNITY_EDITOR
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadLocalMediaStreamSource")]
        internal static extern void LoadLocalMediaStreamSource(MediaStreamSource IMediaSourceHandler);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadRemoteMediaStreamSource")]
        internal static extern void LoadRemoteMediaStreamSource(MediaStreamSource IMediaSourceHandler);
#endif

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPlay")]
        internal static extern void LocalPlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePlay")]
        internal static extern void RemotePlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPause")]
        internal static extern void LocalPause();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePause")]
        internal static extern void RemotePause();
    }
}

and so it’s essentially giving me the pieces that I need to wire up local/remote media streams coming from webRTC into the pieces that can render them in Unity.

If feels like across these projects are the pieces needed to plug together with my basic library project in order to rebuild the app that I had in the previous blog post and have it run inside of a 3D Unity app rather than a 2D XAML app…

Plugging Together the Pieces

Pete put together a regular Unity project targeting UWP for HoloLens and in the scene at the moment we have only 2 quads that we try to render the local and remote video to.

image

and then there’s an empty GameObject named Control with a script on it configured as below;

image

and you can see that this configuration is being used to do a couple of things;

  • Set up the properties that my conversation library code from the previous blog post needed to try and start a conversation over webRTC
    • The signalling server IP address, port number, whether to initiate a conversation or not and, if so, whether there’s a particular peer name to initiate that conversation with.
  • Set up some properties that will facilitate rendering of the video into the materials texturing the 2 quads in the scene.
    • Widths, heights to use.
    • The GameObjects that we want to render our video streams to.

Pete re-worked the original sample code to render to a texture of a material applied to a quad rather than the original rendering to a 2D RawImage.

Now, it’s fairly easy to then add my conversation library into this Unity project so that we can make use of that code. We simply drop it into the Assets of the project and configure up the appropriate build settings for Unity;

image

and also drop in the MediaEngineUWP, Org.WebRtc.dll and WebRtcScheme.dlls;

image

and the job then becomes one of adapting the code that I wrote in the previous blog post to suit the Unity environment which means being able to implement the IMediaManager interface that I came up with for Unity rather than for XAML.

How to go about that? Firstly, We took those PInvoke signatures from the VideoControlSample and put them into a separate static class named Plugin.

Secondly, we implemented that IMediaManager interface on top of the pieces that originated in the sample;

#if ENABLE_WINMD_SUPPORT

using ConversationLibrary.Interfaces;
using ConversationLibrary.Utility;
using Org.WebRtc;
using System;
using System.Linq;
using System.Threading.Tasks;
using UnityEngine;
using UnityEngine.WSA;
using Windows.Media.Core;

public class MediaManager : IMediaManager
{
    // This constructor will be used by the cheap IoC container
    public MediaManager()
    {
        this.textureDetails = CheapContainer.Resolve<ITextureDetailsProvider>();
    }
    // The idea is that this constructor would be used by a real IoC container.
    public MediaManager(ITextureDetailsProvider textureDetails)
    {
        this.textureDetails = textureDetails;
    }
    public Media Media => this.media;

    public MediaStream UserMedia => this.userMedia;

    public MediaVideoTrack RemoteVideoTrack { get => remoteVideoTrack; set => remoteVideoTrack = value; }

    public async Task AddLocalStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateLocalMediaStreamSource(track, LOCAL_VIDEO_FRAME_FORMAT, "SELF"));
        }
    }

    public async Task AddRemoteStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateRemoteMediaStreamSource(track, REMOTE_VIDEO_FRAME_FORMAT, "PEER"));
        }
    }
    void InvokeOnUnityMainThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnAppThread(callback,false);
    }
    void InvokeOnUnityUIThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnUIThread(callback, false);
    }
    public async Task CreateAsync(bool audioEnabled = true, bool videoEnabled = true)
    {
        this.media = Media.CreateMedia();

        // TODO: for the moment, turning audio off as I get an access violation in
        // some piece of code that'll take some debugging.
        RTCMediaStreamConstraints constraints = new RTCMediaStreamConstraints()
        {
            // TODO: switch audio back on, fix the crash.
            audioEnabled = false,
            videoEnabled = true
        };
        this.userMedia = await media.GetUserMedia(constraints);
    }

    public void RemoveLocalStream()
    {
        // TODO: is this ever getting called?
        this.InvokeOnUnityMainThread(
            () => this.DestroyLocalMediaStreamSource());
    }

    public void RemoveRemoteStream()
    {
        this.DestroyRemoteMediaStreamSource();
    }

    public void Shutdown()
    {
        if (this.media != null)
        {
            if (this.localVideoTrack != null)
            {
                this.localVideoTrack.Dispose();
                this.localVideoTrack = null;
            }
            if (this.RemoteVideoTrack != null)
            {
                this.RemoteVideoTrack.Dispose();
                this.RemoteVideoTrack = null;
            }
            this.userMedia = null;
            this.media.Dispose();
            this.media = null;
        }
    }
    void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr playbackTexture = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(
            this.textureDetails.Details.LocalTextureWidth, 
            this.textureDetails.Details.LocalTextureHeight, 
            out playbackTexture);

        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = 
            (Texture)Texture2D.CreateExternalTexture(
                (int)this.textureDetails.Details.LocalTextureWidth, 
                (int)this.textureDetails.Details.LocalTextureHeight, 
                (TextureFormat)14, false, false, playbackTexture);

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadLocalMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.LocalPlay();
    }

    void DestroyLocalMediaStreamSource()
    {
        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();

        IntPtr playbackTexture = IntPtr.Zero;

        Plugin.GetRemotePrimaryTexture(
            this.textureDetails.Details.RemoteTextureWidth, 
            this.textureDetails.Details.RemoteTextureHeight, 
            out playbackTexture);

        // NB: creating textures and calling GetComponent<> has thread affinity for Unity
        // in so far as I can tell.
        var texture = (Texture)Texture2D.CreateExternalTexture(
           (int)this.textureDetails.Details.RemoteTextureWidth,
           (int)this.textureDetails.Details.RemoteTextureHeight,
           (TextureFormat)14, false, false, playbackTexture);

        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = texture;

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadRemoteMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.RemotePlay();
    }

    void DestroyRemoteMediaStreamSource()
    {
        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }
    Media media;
    MediaStream userMedia;
    MediaVideoTrack remoteVideoTrack;
    MediaVideoTrack localVideoTrack;
    ITextureDetailsProvider textureDetails;

    // TODO: temporary hard coding...
    static readonly string LOCAL_VIDEO_FRAME_FORMAT = "I420";
    static readonly string REMOTE_VIDEO_FRAME_FORMAT = "H264";
}
#endif

Naturally, this is very “rough” code right now and there’s some hard-coding going on in there but it didn’t take too much effort to plug these pieces under that interface that I’d brought across from my original, minimal XAML-based project.

So…with all of that said…

Does It Work?

Sort of Smile Firstly, you might notice in the code above that audio is hard-coded to be switched off because we currently have a crash if we switch audio on and it’s some release of some smart pointer in the webRTC pieces that we haven’t yet tracked down.

Minus audio, it’s possible to run the Unity app here on HoloLens and have it connect via the sample-provided signalling service to the original XAML-based PeerCC sample running (e.g.) on my Surface Book and video streams flow and are visible in both directions.

Here’s a screenshot of that “in action” from the point of view of the desktop app receiving video stream from HoloLens;

image

and that screenshot is display 4 things;

  • Bottom right is the local PC’s video stream off its webcam – me wearing a HoloLens.
  • Upper left 75% is the remote stream coming from the webcam on the HoloLens including its holographic content which currently includes;
    • Upper left mid section is the remote video stream from the PC replayed on the HoloLens.
    • Upper right mid section is the local HoloLens video stream replayed on the HoloLens which looked to disappear when I was taking this screenshot.

You might see some numbers in there that suggest 30fps but I think that was a temporary thing and at the time of writing the performance so far is fairly bad but we’ve not had any look at what’s going on there just yet – this ‘play’ sample needs some more investigation.

Where’s the Code?

If you’re interested in following these experiments along as we go forward then the code is in a different location to the previous repo as it’s over here on Pete’s github account;

https://github.com/peted70/web-rtc-test

Feel free to feedback but, of course, apply the massive caveats that this is very rough experimentation at the moment – there’s a long way to go Smile

Rough Notes on UWP and webRTC (Part 3)

This is a follow-on from my previous post around taking small steps with webRTC and UWP.

At the end of that post, I had some scrappy code which was fairly fixed in function in that it was a small UWP app which would use the UWP webRTC library to connect to a signalling service and then could begin a conversation with a peer that was also connected to the same signalling service.

The signalling service in question had to be the one provided with the UWP webRTC bits and the easiest way to test that my app was doing something was to run it against the PeerCC sample which also ships with the UWP webRTC bits and does way more than my app does by demonstrating lots of functionality that’s present in UWP webRTC.

The links to all the webRTC pieces that I’m referring to are in the previous 2 posts on this topic.

Tidying Up

The code that I had in the signalling branch of this github repo at the end of the previous post was quite messy and not really in a position to be re-used and so I spent a little time just pulling that code apart, refactoring some of the functionality behind interfaces and reducing the implicit dependencies in order to try and move the code towards being a little bit more re-usable (even if the functionality it currently implements isn’t of much actual use to a real user – I’m just experimenting).

What I was trying to move towards was some code that I knew sort of worked in this XAML based UWP app that I could then lift out of the app and re-use in a non-XAML based UWP app (i.e. a Unity app) so that I would have some control over the knowns and unknowns in trying out that process.

What I needed to do then was make sure that in refactoring things, I ended up with code that was clearly abstracted from its dependencies on anything in the XAML layer.

Firstly, I refactored the solution into two projects to make for a class library and an app project which referenced it;

image

and then I took some of the pieces of functionality that I had in there and abstracted it out into a set of interfaces;

image

with a view to making the dependencies between these interfaces explicit and the implementation pluggable.

This included putting the code which provides signalling by invoking the signalling service supplied with the original sample behind an interface. Note that I’m not at all trying to come up with a generic interface that could generally represent the notion of signalling in webRTC but, instead, I’m just trying to put an interface on to the existing signalling code that I took (almost) entirely from the PeerCC sample project in the UWP webRTC bits.

image

The other interfaces/services that I added here are hopefully named ‘reasonably well’ in terms of the functionality that they represent with perhaps the one that’s not quite so obvious obvious being the IConversationManager.

This interface is just my attempt to codify the minimum functionality that I need to bring the other interface implementations together in order to get any kind of conversation over webRTC up and running from my little sample app as it stands and that IConversationManager interface right now just looks as below;

image

and so the idea here is that a consumer of an IConversationManager can simply;

  • Tell the manager whether it is meant to initiate conversations or simply wait for a remote peer to being a conversation with it
    • In terms of initiating conversations – the code is ‘aggressive’ in that it simply finds the first peer that it sees provided by the signalling service and attempts to being a conversation with it.
  • Call InitialiseAsync providing the name that the local peer wants to be represented by.
  • Call ConnectToSignallingAsync with the IP Address and port where the signalling service is to be found.

From there, the implementation jumps in and tries to bring together all the right pieces to get a conversation flowing.

In making these abstractions, I found two places where I had to apply a little bit of thought and that was where;

  • The UWP webRTC pieces need initialising with a Dispatcher object and so I abstracted that out into an interface so that an implementation can be injected into the underlying layer.
  • There is a need at some point to do some work with UI objects to represent media streams. In the code to date, this has meant working with XAML MediaElements but in other scenarios (e.g. Unity UI) that wouldn’t work.

In order to try and abstract the library code from these media pieces, I made an IMediaManager interface with the intention being to write a different implementation for the different UI layers so to bring this library up inside of a Unity app I’d at least need to provide a Unity version of the highlighted implementation pieces below which are about IMediaManager in a XAML UI world;

image

My main project took a dependency on autofac to provide a container from which to serve up the implementations of my interfaces and I did a cheap trick of providing my own “container” embedded into the library and named CheapContainer in case the library was going to be used in a situation where autofac or some other IoC container wasn’t immediately available.

Configuration of the container then moves into my App.xaml.cs file and is fairly simple and I wrote it twice, once for autofac and once using my own CheapContainer;

#if !USE_CHEAP_CONTAINER
        Autofac.IContainer Container
        {
            get
            {
                if (this.iocContainer == null)
                {
                    this.BuildContainer();
                }
                return (this.iocContainer);
            }
        }
#endif
        void BuildContainer()
        {
#if USE_CHEAP_CONTAINER
            CheapContainer.Register<ISignallingService, Signaller>();
            CheapContainer.Register<IDispatcherProvider, XamlMediaElementProvider>();
            CheapContainer.Register<IXamlMediaElementProvider, XamlMediaElementProvider>();
            CheapContainer.Register<IMediaManager, XamlMediaElementMediaManager>();
            CheapContainer.Register<IPeerManager, PeerManager>();
            CheapContainer.Register<IConversationManager, ConversationManager>();
#else
            var builder = new ContainerBuilder();
            builder.RegisterType<Signaller>().As<ISignallingService>().SingleInstance();

            builder.RegisterType<XamlMediaElementProvider>().As<IXamlMediaElementProvider>().As<IDispatcherProvider>().SingleInstance();

            builder.RegisterType<XamlMediaElementMediaManager>().As<IMediaManager>().SingleInstance();
            builder.RegisterType<PeerManager>().As<IPeerManager>().SingleInstance();
            builder.RegisterType<ConversationManager>().As<IConversationManager>().SingleInstance();
            builder.RegisterType<MainPage>().AsSelf().SingleInstance();
            this.iocContainer = builder.Build();
#endif
        }
#if USE_CHEAP_CONTAINER
#else
        Autofac.IContainer iocContainer;
#endif

and the code which now lives inside of my MainPage.xaml.cs file involved in actually getting the webRTC conversation up and running is reduced down to almost nothing;

        async void OnConnectToSignallingAsync()
        {
            await this.conversationManager.InitialiseAsync(this.addressDetails.HostName);

            this.conversationManager.IsInitiator = this.isInitiator;

            this.HasConnected = await this.conversationManager.ConnectToSignallingAsync(
                this.addressDetails.IPAddress, this.addressDetails.Port);
        }

and so that seems a lot simpler, neater and more re-usable than what I’d had at the end of the previous blog post.

In subsequent posts, I’m going to see if I can now re-use this library inside of other environments (e.g. Unity) so as to bring this same (very limited) webRTC functionality that I’ve been playing with to that environment.

First Experiment with Image Classification on Windows ML from UWP

There are a broad set of scenarios that are enabled by making calls into the intelligent cloud-services offered by Cognitive Services around vision, speech, knowledge, search and language.

I’ve written quite a lot about those services in the past on this blog and I showed them at events and in use on the Context show that I used to make for Channel 9 around ever more personal and contextual computing.

In the show, we often talked about what could be done in the cloud alongside what might be done locally on a device and specifically we looked at UWP (i.e. on device) support for speech and facial detection and we dug into using depth cameras and runtimes like the Intel RealSense cameras and Kinect sensors for face, hand and body tracking. Some of those ‘special camera’ capabilities have most recently been surfaced again by Project Gesture (formerly ‘Prague’) and I’ve written about some of that too.

I’m interested in these types of technologies and, against that backdrop, I was very excited to see the announcement of the;

AI Platform for Windows Developers

which brings to the UWP the capability to run pre-trained learning models inside an app running on Windows devices including (as the blog post that I referenced says) on HoloLens and IoT where you can think of a tonne of potential scenarios. I’m particularly keen to think about this on HoloLens where the device is making decisions around the user’s context in near-real-time and so being able to make low-latency calls for intelligence is likely to be very powerful.

The announcement was also quite timely for me as recently I’d got a bit frustrated (Winking smile!) around the UWP’s lack of support for this type of workload – a little background …

Recent UK University Hacks on Cognitive Services

I took part in a couple of hack events at UK universities in the past couple of months that were themed around cognitive services and had a great time watching and helping students hack on the services and especially the vision services.

As part of preparing for those hacks, I made use of the “Custom Vision” service for image classification for the first time;

image

and I found it to be a really accessible service to make use of and I very quickly managed to build an image classification model which I trained over a number of iterations to differentiate between pictures which contained either dachshund dogs, ponies or some other type of dog although I didn’t train on too many non – dachshunds and so the model is a little weak in that area.

Here’s the portal where I have my dachshund recognition project going on;

image

and it works really well and I found it very easy to put together and you could put together your own classifier by following the tutorial here;

Overview of building a classifier with Custom Vision

and as part of the hack I watched a lot of participating students make use of the custom vision service and then realise that they wanted this functionality available on their device rather than just in the cloud and they followed the guidance here;

Export your model to mobile

to take the model that had been produced and export it so that they could make use of it locally inside apps running on Android or iOS via the export function;

image

and my frustration in looking at these pieces was that I had absolutely no idea how I would export one of these models and use it within an app running on the Universal Windows Platform.

Naturally, it’s easy to understand why iOS and Android support was added here but I was really pleased to see that announcement around Windows ML Smile and I thought that I’d try it out by taking my existing dachshund classification model built and trained in the cloud and seeing if I could run it against a video stream inside of a Windows 10 UWP app.

Towards that end, I produced a new iteration of my model trained on the “General (compact) domain” so that it could be exported;

image

and then I used the “Export” menu to save it to my desktop in CoreML format named dachshund.mlmodel.

Checking out the Docs

I had a good look through the documentation around Windows Machine Learning here;

Machine Learning

and set about trying to get the right bits of software together to see if I could make an app.

Getting the Right Bits

Operating System and SDK

At the time of writing, I’m running Windows 10 Fall Creators Update (16299) as my main operating system and support for these new Windows ML capabilities are coming in the next update which is in preview right now.

Consequently, I had some work to do to get the right OS and SDKs;

  • I moved a machine to the Windows Insider Preview 17115.1 via Windows Update
  • I grabbed the Windows 10 SDK 17110 Preview from the Insiders site.

Python and Machine Learning

I didn’t have an Python installation on the machine in question so I went and grabbed Python 2.7 from https://www.python.org/. I did initially try 3.6 but had some problems with scripts on that and, as a non-Python person, I came to the conclusion that the best plan might be to try 2.7 which did seem to work for me.

I knew that I need to convert my model from CoreML to ONNX and so I followed the document here;

Convert a model

to set about that process and that involved doing some pip installs and, specfically, for me I ended up running;

pip install coremltools

pip install onnxmltools

pip install winmltools

and that seemed to give me all that I needed to try and convert my model.

Converting the Model to ONNX

Just as the docs described, I ended up running these commands to do that conversion in a python environment;


from coremltools.models.utils import load_spec
from winmltools import convert_coreml

from winmltools.utils import save_model

model_coreml = load_spec(‘c:\users\mtaul\desktop\dachshunds.mlmodel’)

model_onnx = convert_coreml(model_coreml)

save_model (model_onnx, ‘c:\users\mtaul\desktop\dachshunds.onnx’)

and that all seemed to work quite nicely. I also took a look at my original model_coreml.description which gave me;

input {
   name: “data”
   type {
     imageType {
       width: 227
       height: 227
       colorSpace: BGR
     }
   }
}
output {
   name: “loss”
   type {
     dictionaryType {
       stringKeyType {
       }
     }
   }
}
output {
   name: “classLabel”
   type {
     stringType {
     }
   }
}
predictedFeatureName: “classLabel”
predictedProbabilitiesName: “loss”

which seemed reasonable but I’m not really qualified to know whether it was exactly right or not – the mechanics of these models are a bit beyond me at the time of writing Smile

Having converted my model, though, I thought that I’d see if I could write some code against it.

Generating Code

I’d read about a code generation step in the document here;

Automatic Code Generation

and so I tried to use the mlgen tool on my .onnx model to generate some code. This was pretty easy and I just ran the command line;

“c:\Program Files (x86)\Windows Kits\10\bin\10.0.17110.0\x86\mlgen.exe” -i dachshunds.onnx -l CS -n “dachshunds.model” -o dachshunds.cs

and it spat out some C# code (it also does CPPCX) which is fairly short and which you could fairly easily construct yourself by looking at the types in the Windows.AI.MachineLearning.Preview namespace.

The C# code contained some machine generated names and so I replaced those and this is the code which I ended up with;

namespace daschunds.model
{
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using Windows.AI.MachineLearning.Preview;
    using Windows.Media;
    using Windows.Storage;

    // MIKET: I renamed the auto generated long number class names to be 'Daschund'
    // to make it easier for me as a human to deal with them 🙂
    public sealed class DacshundModelInput
    {
        public VideoFrame data { get; set; }
    }

    public sealed class DacshundModelOutput
    {
        public IList<string> classLabel { get; set; }
        public IDictionary<string, float> loss { get; set; }
        public DacshundModelOutput()
        {
            this.classLabel = new List<string>();
            this.loss = new Dictionary<string, float>();

            // MIKET: I added these 3 lines of code here after spending *quite some time* 🙂
            // Trying to debug why I was getting a binding excption at the point in the
            // code below where the call to LearningModelBindingPreview.Bind is called
            // with the parameters ("loss", output.loss) where output.loss would be
            // an empty Dictionary<string,float>.
            //
            // The exception would be 
            // "The binding is incomplete or does not match the input/output description. (Exception from HRESULT: 0x88900002)"
            // And I couldn't find symbols for Windows.AI.MachineLearning.Preview to debug it.
            // So...this could be wrong but it works for me and the 3 values here correspond
            // to the 3 classifications that my classifier produces.
            //
            this.loss.Add("daschund", float.NaN);
            this.loss.Add("dog", float.NaN);
            this.loss.Add("pony", float.NaN);
        }
    }

    public sealed class DacshundModel
    {
        private LearningModelPreview learningModel;
        public static async Task<DacshundModel> CreateDaschundModel(StorageFile file)
        {
            LearningModelPreview learningModel = await LearningModelPreview.LoadModelFromStorageFileAsync(file);
            DacshundModel model = new DacshundModel();
            model.learningModel = learningModel;
            return model;
        }
        public async Task<DacshundModelOutput> EvaluateAsync(DacshundModelInput input) {
            DacshundModelOutput output = new DacshundModelOutput();
            LearningModelBindingPreview binding = new LearningModelBindingPreview(learningModel);

            binding.Bind("data", input.data);
            binding.Bind("classLabel", output.classLabel);

            // MIKET: this generated line caused me trouble. See MIKET comment above.
            binding.Bind("loss", output.loss);

            LearningModelEvaluationResultPreview evalResult = await learningModel.EvaluateAsync(binding, string.Empty);
            return output;
        }
    }
}

There’s a big comment Smile in that code where I changed what had been generated for me. In short, I found that my model seems to take an input parameter here of type VideoFrame and it seems to produce output parameters of two ‘shapes’;

  • List<string> called “classLabel”
  • Dictionary<string,float> called “loss”

I spent quite a bit of time debugging an exception that I got by passing an empty Dictionary<string,float> as the variable called “loss” as I would see an exception thrown from the call to LearningModelBindingPreview.Bind() saying that the “binding is incomplete”.

It took a while but I finally figured out that I was supposed to pass a Dictionary<string,float> with some entries already in it and you’ll notice in the code above that I pass in 3 floats which I think are related to the 3 tags that my model can categorise against – namely dachshunds, dogs and pony. I’m not at all sure that this is 100% right but it got me past that exception so I went with it Smile

With that, I had some generated code that I thought I could build into an app.

Making a ‘Hello World’ App

I made a very simple UWP app targeting SDK 17110 and made a UI which had a few TextBlocks and a CaptureElement within it.

<Page
    x:Class="App1.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:App1"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    mc:Ignorable="d">

    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
        <CaptureElement x:Name="captureElement"/>
        <StackPanel HorizontalAlignment="Center" VerticalAlignment="Bottom">
            <StackPanel.Resources>
                <Style TargetType="TextBlock">
                    <Setter Property="Foreground" Value="White"/>
                    <Setter Property="FontSize" Value="18"/>
                    <Setter Property="Margin" Value="5"/>
                </Style>
            </StackPanel.Resources>
            <TextBlock Text="Category " HorizontalTextAlignment="Center"><Run Text="{x:Bind Category,Mode=OneWay}"/></TextBlock>
            <StackPanel Orientation="Horizontal" HorizontalAlignment="Center">
                <TextBlock Text="Dacshund "><Run Text="{x:Bind Dacshund,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Dog "><Run Text="{x:Bind Dog,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Pony "><Run Text="{x:Bind Pony,Mode=OneWay}"/></TextBlock>
            </StackPanel>
        </StackPanel>
    </Grid>
</Page>

and then I wrote some code which would get hold of a camera on the device (I went for the back panel camera), wire it up to the CaptureElement in the UI and also to make use of a MediaFrameReader to get preview video frames off the camera which I’m hoping to run through the classification model.

That code is here – there’s some discussion to come in a moment about the RESIZE constant;

//#define RESIZE
namespace App1
{
    using daschunds.model;
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.Linq;
    using System.Runtime.InteropServices;
    using System.Threading;
    using System.Threading.Tasks;
    using Windows.Devices.Enumeration;
    using Windows.Graphics.Imaging;
    using Windows.Media.Capture;
    using Windows.Media.Capture.Frames;
    using Windows.Media.Devices;
    using Windows.Storage;
    using Windows.Storage.Streams;
    using Windows.UI.Xaml;
    using Windows.UI.Xaml.Controls;
    using Windows.UI.Xaml.Media.Imaging;
    using Windows.Media;
    using System.ComponentModel;
    using System.Runtime.CompilerServices;
    using Windows.UI.Core;
    using System.Runtime.InteropServices.WindowsRuntime;

    public sealed partial class MainPage : Page, INotifyPropertyChanged
    {
        public event PropertyChangedEventHandler PropertyChanged;

        public MainPage()
        {
            this.InitializeComponent();
            this.inputData = new DacshundModelInput();
            this.Loaded += OnLoaded;
        }
        public string Dog
        {
            get => this.dog;
            set => this.SetProperty(ref this.dog, value);
        }
        string dog;
        public string Pony
        {
            get => this.pony;
            set => this.SetProperty(ref this.pony, value);
        }
        string pony;
        public string Dacshund
        {
            get => this.daschund;
            set => this.SetProperty(ref this.daschund, value);
        }
        string daschund;
        public string Category
        {
            get => this.category;
            set => this.SetProperty(ref this.category, value);
        }
        string category;
        async Task LoadModelAsync()
        {
            var file = await StorageFile.GetFileFromApplicationUriAsync(
                new Uri("ms-appx:///Model/daschunds.onnx"));

            this.learningModel = await DacshundModel.CreateDaschundModel(file);
        }
        async Task<DeviceInformation> GetFirstBackPanelVideoCaptureAsync()
        {
            var devices = await DeviceInformation.FindAllAsync(
                DeviceClass.VideoCapture);

            var device = devices.FirstOrDefault(
                d => d.EnclosureLocation.Panel == Windows.Devices.Enumeration.Panel.Back);

            return (device);
        }
        async void OnLoaded(object sender, RoutedEventArgs e)
        {
            await this.LoadModelAsync();

            var device = await this.GetFirstBackPanelVideoCaptureAsync();

            if (device != null)
            {
                await this.CreateMediaCaptureAsync(device);
                await this.mediaCapture.StartPreviewAsync();

                await this.CreateMediaFrameReaderAsync();
                await this.frameReader.StartAsync();
            }
        }

        async Task CreateMediaFrameReaderAsync()
        {
            var frameSource = this.mediaCapture.FrameSources.Where(
                source => source.Value.Info.SourceKind == MediaFrameSourceKind.Color).First();

            this.frameReader =
                await this.mediaCapture.CreateFrameReaderAsync(frameSource.Value);

            this.frameReader.FrameArrived += OnFrameArrived;
        }

        async Task CreateMediaCaptureAsync(DeviceInformation device)
        {
            this.mediaCapture = new MediaCapture();

            await this.mediaCapture.InitializeAsync(
                new MediaCaptureInitializationSettings()
                {
                    VideoDeviceId = device.Id
                }
            );
            // Try and set auto focus but on the Surface Pro 3 I'm running on, this
            // won't work.
            if (this.mediaCapture.VideoDeviceController.FocusControl.Supported)
            {
                await this.mediaCapture.VideoDeviceController.FocusControl.SetPresetAsync(FocusPreset.AutoNormal);
            }
            else
            {
                // Nor this.
                this.mediaCapture.VideoDeviceController.Focus.TrySetAuto(true);
            }
            this.captureElement.Source = this.mediaCapture;
        }

        async void OnFrameArrived(MediaFrameReader sender, MediaFrameArrivedEventArgs args)
        {
            if (Interlocked.CompareExchange(ref this.processingFlag, 1, 0) == 0)
            {
                try
                {
                    using (var frame = sender.TryAcquireLatestFrame())
                    using (var videoFrame = frame.VideoMediaFrame?.GetVideoFrame())
                    {
                        if (videoFrame != null)
                        {
                            // From the description (both visible in Python and through the
                            // properties of the model that I can interrogate with code at
                            // runtime here) my image seems to to be 227 by 227 which is an 
                            // odd size but I'm assuming that I should resize the frame here to 
                            // suit that. I'm also assuming that what I'm doing here is 
                            // expensive 

#if RESIZE
                            using (var resizedBitmap = await ResizeVideoFrame(videoFrame, IMAGE_SIZE, IMAGE_SIZE))
                            using (var resizedFrame = VideoFrame.CreateWithSoftwareBitmap(resizedBitmap))
                            {
                                this.inputData.data = resizedFrame;
#else       
                                this.inputData.data = videoFrame;
#endif // RESIZE

                                var evalOutput = await this.learningModel.EvaluateAsync(this.inputData);

                                await this.ProcessOutputAsync(evalOutput);

#if RESIZE
                            }
#endif // RESIZE
                        }
                    }
                }
                finally
                {
                    Interlocked.Exchange(ref this.processingFlag, 0);
                }
            }
        }
        string BuildOutputString(DacshundModelOutput evalOutput, string key)
        {
            var result = "no";

            if (evalOutput.loss[key] > 0.25f)
            {
                result = $"{evalOutput.loss[key]:N2}";
            }
            return (result);
        }
        async Task ProcessOutputAsync(DacshundModelOutput evalOutput)
        {
            string category = evalOutput.classLabel.FirstOrDefault() ?? "none";
            string dog = $"{BuildOutputString(evalOutput, "dog")}";
            string pony = $"{BuildOutputString(evalOutput, "pony")}";
            string dacshund = $"{BuildOutputString(evalOutput, "daschund")}";

            await this.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
                () =>
                {
                    this.Dog = dog;
                    this.Pony = pony;
                    this.Dacshund = dacshund;
                    this.Category = category;
                }
            );
        }

        /// <summary>
        /// This is horrible - I am trying to resize a VideoFrame and I haven't yet
        /// found a good way to do it so this function goes through a tonne of
        /// stuff to try and resize it but it's not pleasant at all.
        /// </summary>
        /// <param name="frame"></param>
        /// <param name="width"></param>
        /// <param name="height"></param>
        /// <returns></returns>
        async static Task<SoftwareBitmap> ResizeVideoFrame(VideoFrame frame, int width, int height)
        {
            SoftwareBitmap bitmapFromFrame = null;
            bool ownsFrame = false;

            if (frame.Direct3DSurface != null)
            {
                bitmapFromFrame = await SoftwareBitmap.CreateCopyFromSurfaceAsync(
                    frame.Direct3DSurface,
                    BitmapAlphaMode.Ignore);

                ownsFrame = true;
            }
            else if (frame.SoftwareBitmap != null)
            {
                bitmapFromFrame = frame.SoftwareBitmap;
            }

            // We now need it in a pixel format that an encoder is happy with
            var encoderBitmap = SoftwareBitmap.Convert(
                bitmapFromFrame, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            if (ownsFrame)
            {
                bitmapFromFrame.Dispose();
            }

            // We now need an encoder, should we keep creating it?
            var memoryStream = new MemoryStream();

            var encoder = await BitmapEncoder.CreateAsync(
                BitmapEncoder.JpegEncoderId, memoryStream.AsRandomAccessStream());

            encoder.SetSoftwareBitmap(encoderBitmap);
            encoder.BitmapTransform.ScaledWidth = (uint)width;
            encoder.BitmapTransform.ScaledHeight = (uint)height;

            await encoder.FlushAsync();

            var decoder = await BitmapDecoder.CreateAsync(memoryStream.AsRandomAccessStream());

            var resizedBitmap = await decoder.GetSoftwareBitmapAsync(
                BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            memoryStream.Dispose();

            encoderBitmap.Dispose();

            return (resizedBitmap);
        }
        void SetProperty<T>(ref T storage, T value, [CallerMemberName] string propertyName = null)
        {
            storage = value;
            this.PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
        }
        DacshundModelInput inputData;
        int processingFlag;
        MediaFrameReader frameReader;
        MediaCapture mediaCapture;
        DacshundModel learningModel;

        static readonly int IMAGE_SIZE = 227;
    }
}

In doing that, the main thing that I was unclear about was whether I need to resize the VideoFrames to fit with my model or whether I could leave them alone and have the code in between me and the model “do the right thing” with the VideoFrame?

Partly, that confusion comes from my model’s description seeming to be say that it was expecting frames at a resolution of 227 x 227 in BGR format and that feels like a very odd resolution to me.

Additionally, I found that trying to resize a VideoFrame seemed to be a bit of a painful task and I didn’t find a better way than going through a SoftwareBitmap with a BitmapEncoder, BitmapDecoder and a BitmapTransform.

The code that I ended up with got fairly ugly and I was never quite sure whether I needed it or not and so, for the moment, I conditionally compiled that code into my little test app so that I can switch between two modes of;

  • Pass the VideoFrame untouched to the underlying evaluation layer
  • Attempt to resize the VideoFrame to 227 x 227 before passing it to the underlying evaluation layer.

I’ve a feeling that it’s ok to leave the VideoFrame untouched but I’m about 20% sure on that at the time of writing and the follow on piece here assumes that I’m running with that version of the code.

Does It Work?

How does the app work out? I’m not yet sure Smile and there’s a couple of things where I’m not certain.

  • I’m running on a Surface Pro 3 where the camera has a fixed focus and it doesn’t do a great job of focusing on my images (given that I’ve no UI to control the focus) and so it’s hard to tell at times how good an image the camera is getting. I’ve tried it with both the front and back cameras on that device but I don’t see too much difference.
  • I’m unsure of whether the way in which I’m passing the VideoFrame to the model is right or not.

But I did run the app and presented it with 3 pictures – one of a dachshund, one of an alsatian (which it should understand is a dog but not a dachshund) and one of a pony.

Here’s some examples showing the sort of output that the app displayed;

dacs

I’m not sure about the category of ‘dog’ here but the app seems fairly confident that this is both a dog and a dachshund so that seems good to me.

Here’s another (the model has been trained on alsatian images to some extent);

als

and so that seems like a good result and then I held up my phone to the video stream displaying an image of a pony;

pony

and so that seems to work reasonably well and that’s the code which does not resize the image down to 227×227 and I found that the code that did resize didn’t seem to work the same way so maybe my notion of resizing (or the actual code which does the resizing) isn’t right.

Wrapping Up

First impressions here are very good Smile in that I managed to get something working in very short time.

Naturally, it’d be interesting to try and build a better understanding around the binding of parameters and I’d also be interested to try this out with a camera that was doing a better job of focusing.

It’d also be interesting to point the camera at real world objects rather than 2D pictures of those objects and so perhaps I need to build a model that classifies something a little more ‘household’ than dogs and ponies to make it easier to test without going out into a field Smile

I’d also like to try some of this out on other types of devices including HoloLens as/when that becomes possible.

Code

If you want the code that I put together here then it’s in this github repo;

https://github.com/mtaulty/WindowsMLExperiment

Keep in mind that this is just my first experiment, I’m muddling my way through and it looks like the code conditionally compiled out with the RESIZE constant can be ignored unless I hear otherwise and I’ll update the post if I do.

Lastly, you’ve probably noticed many different spellings of the word dachshund in the code and in the blog post – I should have stuck with poodles Smile