First Experiment with Image Classification on Windows ML from UWP

There are a broad set of scenarios that are enabled by making calls into the intelligent cloud-services offered by Cognitive Services around vision, speech, knowledge, search and language.

I’ve written quite a lot about those services in the past on this blog and I showed them at events and in use on the Context show that I used to make for Channel 9 around ever more personal and contextual computing.

In the show, we often talked about what could be done in the cloud alongside what might be done locally on a device and specifically we looked at UWP (i.e. on device) support for speech and facial detection and we dug into using depth cameras and runtimes like the Intel RealSense cameras and Kinect sensors for face, hand and body tracking. Some of those ‘special camera’ capabilities have most recently been surfaced again by Project Gesture (formerly ‘Prague’) and I’ve written about some of that too.

I’m interested in these types of technologies and, against that backdrop, I was very excited to see the announcement of the;

AI Platform for Windows Developers

which brings to the UWP the capability to run pre-trained learning models inside an app running on Windows devices including (as the blog post that I referenced says) on HoloLens and IoT where you can think of a tonne of potential scenarios. I’m particularly keen to think about this on HoloLens where the device is making decisions around the user’s context in near-real-time and so being able to make low-latency calls for intelligence is likely to be very powerful.

The announcement was also quite timely for me as recently I’d got a bit frustrated (Winking smile!) around the UWP’s lack of support for this type of workload – a little background …

Recent UK University Hacks on Cognitive Services

I took part in a couple of hack events at UK universities in the past couple of months that were themed around cognitive services and had a great time watching and helping students hack on the services and especially the vision services.

As part of preparing for those hacks, I made use of the “Custom Vision” service for image classification for the first time;

image

and I found it to be a really accessible service to make use of and I very quickly managed to build an image classification model which I trained over a number of iterations to differentiate between pictures which contained either dachshund dogs, ponies or some other type of dog although I didn’t train on too many non – dachshunds and so the model is a little weak in that area.

Here’s the portal where I have my dachshund recognition project going on;

image

and it works really well and I found it very easy to put together and you could put together your own classifier by following the tutorial here;

Overview of building a classifier with Custom Vision

and as part of the hack I watched a lot of participating students make use of the custom vision service and then realise that they wanted this functionality available on their device rather than just in the cloud and they followed the guidance here;

Export your model to mobile

to take the model that had been produced and export it so that they could make use of it locally inside apps running on Android or iOS via the export function;

image

and my frustration in looking at these pieces was that I had absolutely no idea how I would export one of these models and use it within an app running on the Universal Windows Platform.

Naturally, it’s easy to understand why iOS and Android support was added here but I was really pleased to see that announcement around Windows ML Smile and I thought that I’d try it out by taking my existing dachshund classification model built and trained in the cloud and seeing if I could run it against a video stream inside of a Windows 10 UWP app.

Towards that end, I produced a new iteration of my model trained on the “General (compact) domain” so that it could be exported;

image

and then I used the “Export” menu to save it to my desktop in CoreML format named dachshund.mlmodel.

Checking out the Docs

I had a good look through the documentation around Windows Machine Learning here;

Machine Learning

and set about trying to get the right bits of software together to see if I could make an app.

Getting the Right Bits

Operating System and SDK

At the time of writing, I’m running Windows 10 Fall Creators Update (16299) as my main operating system and support for these new Windows ML capabilities are coming in the next update which is in preview right now.

Consequently, I had some work to do to get the right OS and SDKs;

  • I moved a machine to the Windows Insider Preview 17115.1 via Windows Update
  • I grabbed the Windows 10 SDK 17110 Preview from the Insiders site.

Python and Machine Learning

I didn’t have an Python installation on the machine in question so I went and grabbed Python 2.7 from https://www.python.org/. I did initially try 3.6 but had some problems with scripts on that and, as a non-Python person, I came to the conclusion that the best plan might be to try 2.7 which did seem to work for me.

I knew that I need to convert my model from CoreML to ONNX and so I followed the document here;

Convert a model

to set about that process and that involved doing some pip installs and, specfically, for me I ended up running;

pip install coremltools

pip install onnxmltools

pip install winmltools

and that seemed to give me all that I needed to try and convert my model.

Converting the Model to ONNX

Just as the docs described, I ended up running these commands to do that conversion in a python environment;


from coremltools.models.utils import load_spec
from winmltools import convert_coreml

from winmltools.utils import save_model

model_coreml = load_spec(‘c:\users\mtaul\desktop\dachshunds.mlmodel’)

model_onnx = convert_coreml(model_coreml)

save_model (model_onnx, ‘c:\users\mtaul\desktop\dachshunds.onnx’)

and that all seemed to work quite nicely. I also took a look at my original model_coreml.description which gave me;

input {
   name: “data”
   type {
     imageType {
       width: 227
       height: 227
       colorSpace: BGR
     }
   }
}
output {
   name: “loss”
   type {
     dictionaryType {
       stringKeyType {
       }
     }
   }
}
output {
   name: “classLabel”
   type {
     stringType {
     }
   }
}
predictedFeatureName: “classLabel”
predictedProbabilitiesName: “loss”

which seemed reasonable but I’m not really qualified to know whether it was exactly right or not – the mechanics of these models are a bit beyond me at the time of writing Smile

Having converted my model, though, I thought that I’d see if I could write some code against it.

Generating Code

I’d read about a code generation step in the document here;

Automatic Code Generation

and so I tried to use the mlgen tool on my .onnx model to generate some code. This was pretty easy and I just ran the command line;

“c:\Program Files (x86)\Windows Kits\10\bin\10.0.17110.0\x86\mlgen.exe” -i dachshunds.onnx -l CS -n “dachshunds.model” -o dachshunds.cs

and it spat out some C# code (it also does CPPCX) which is fairly short and which you could fairly easily construct yourself by looking at the types in the Windows.AI.MachineLearning.Preview namespace.

The C# code contained some machine generated names and so I replaced those and this is the code which I ended up with;

namespace daschunds.model
{
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using Windows.AI.MachineLearning.Preview;
    using Windows.Media;
    using Windows.Storage;

    // MIKET: I renamed the auto generated long number class names to be 'Daschund'
    // to make it easier for me as a human to deal with them 🙂
    public sealed class DacshundModelInput
    {
        public VideoFrame data { get; set; }
    }

    public sealed class DacshundModelOutput
    {
        public IList<string> classLabel { get; set; }
        public IDictionary<string, float> loss { get; set; }
        public DacshundModelOutput()
        {
            this.classLabel = new List<string>();
            this.loss = new Dictionary<string, float>();

            // MIKET: I added these 3 lines of code here after spending *quite some time* 🙂
            // Trying to debug why I was getting a binding excption at the point in the
            // code below where the call to LearningModelBindingPreview.Bind is called
            // with the parameters ("loss", output.loss) where output.loss would be
            // an empty Dictionary<string,float>.
            //
            // The exception would be 
            // "The binding is incomplete or does not match the input/output description. (Exception from HRESULT: 0x88900002)"
            // And I couldn't find symbols for Windows.AI.MachineLearning.Preview to debug it.
            // So...this could be wrong but it works for me and the 3 values here correspond
            // to the 3 classifications that my classifier produces.
            //
            this.loss.Add("daschund", float.NaN);
            this.loss.Add("dog", float.NaN);
            this.loss.Add("pony", float.NaN);
        }
    }

    public sealed class DacshundModel
    {
        private LearningModelPreview learningModel;
        public static async Task<DacshundModel> CreateDaschundModel(StorageFile file)
        {
            LearningModelPreview learningModel = await LearningModelPreview.LoadModelFromStorageFileAsync(file);
            DacshundModel model = new DacshundModel();
            model.learningModel = learningModel;
            return model;
        }
        public async Task<DacshundModelOutput> EvaluateAsync(DacshundModelInput input) {
            DacshundModelOutput output = new DacshundModelOutput();
            LearningModelBindingPreview binding = new LearningModelBindingPreview(learningModel);

            binding.Bind("data", input.data);
            binding.Bind("classLabel", output.classLabel);

            // MIKET: this generated line caused me trouble. See MIKET comment above.
            binding.Bind("loss", output.loss);

            LearningModelEvaluationResultPreview evalResult = await learningModel.EvaluateAsync(binding, string.Empty);
            return output;
        }
    }
}

There’s a big comment Smile in that code where I changed what had been generated for me. In short, I found that my model seems to take an input parameter here of type VideoFrame and it seems to produce output parameters of two ‘shapes’;

  • List<string> called “classLabel”
  • Dictionary<string,float> called “loss”

I spent quite a bit of time debugging an exception that I got by passing an empty Dictionary<string,float> as the variable called “loss” as I would see an exception thrown from the call to LearningModelBindingPreview.Bind() saying that the “binding is incomplete”.

It took a while but I finally figured out that I was supposed to pass a Dictionary<string,float> with some entries already in it and you’ll notice in the code above that I pass in 3 floats which I think are related to the 3 tags that my model can categorise against – namely dachshunds, dogs and pony. I’m not at all sure that this is 100% right but it got me past that exception so I went with it Smile

With that, I had some generated code that I thought I could build into an app.

Making a ‘Hello World’ App

I made a very simple UWP app targeting SDK 17110 and made a UI which had a few TextBlocks and a CaptureElement within it.

<Page
    x:Class="App1.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:App1"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    mc:Ignorable="d">

    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
        <CaptureElement x:Name="captureElement"/>
        <StackPanel HorizontalAlignment="Center" VerticalAlignment="Bottom">
            <StackPanel.Resources>
                <Style TargetType="TextBlock">
                    <Setter Property="Foreground" Value="White"/>
                    <Setter Property="FontSize" Value="18"/>
                    <Setter Property="Margin" Value="5"/>
                </Style>
            </StackPanel.Resources>
            <TextBlock Text="Category " HorizontalTextAlignment="Center"><Run Text="{x:Bind Category,Mode=OneWay}"/></TextBlock>
            <StackPanel Orientation="Horizontal" HorizontalAlignment="Center">
                <TextBlock Text="Dacshund "><Run Text="{x:Bind Dacshund,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Dog "><Run Text="{x:Bind Dog,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Pony "><Run Text="{x:Bind Pony,Mode=OneWay}"/></TextBlock>
            </StackPanel>
        </StackPanel>
    </Grid>
</Page>

and then I wrote some code which would get hold of a camera on the device (I went for the back panel camera), wire it up to the CaptureElement in the UI and also to make use of a MediaFrameReader to get preview video frames off the camera which I’m hoping to run through the classification model.

That code is here – there’s some discussion to come in a moment about the RESIZE constant;

//#define RESIZE
namespace App1
{
    using daschunds.model;
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.Linq;
    using System.Runtime.InteropServices;
    using System.Threading;
    using System.Threading.Tasks;
    using Windows.Devices.Enumeration;
    using Windows.Graphics.Imaging;
    using Windows.Media.Capture;
    using Windows.Media.Capture.Frames;
    using Windows.Media.Devices;
    using Windows.Storage;
    using Windows.Storage.Streams;
    using Windows.UI.Xaml;
    using Windows.UI.Xaml.Controls;
    using Windows.UI.Xaml.Media.Imaging;
    using Windows.Media;
    using System.ComponentModel;
    using System.Runtime.CompilerServices;
    using Windows.UI.Core;
    using System.Runtime.InteropServices.WindowsRuntime;

    public sealed partial class MainPage : Page, INotifyPropertyChanged
    {
        public event PropertyChangedEventHandler PropertyChanged;

        public MainPage()
        {
            this.InitializeComponent();
            this.inputData = new DacshundModelInput();
            this.Loaded += OnLoaded;
        }
        public string Dog
        {
            get => this.dog;
            set => this.SetProperty(ref this.dog, value);
        }
        string dog;
        public string Pony
        {
            get => this.pony;
            set => this.SetProperty(ref this.pony, value);
        }
        string pony;
        public string Dacshund
        {
            get => this.daschund;
            set => this.SetProperty(ref this.daschund, value);
        }
        string daschund;
        public string Category
        {
            get => this.category;
            set => this.SetProperty(ref this.category, value);
        }
        string category;
        async Task LoadModelAsync()
        {
            var file = await StorageFile.GetFileFromApplicationUriAsync(
                new Uri("ms-appx:///Model/daschunds.onnx"));

            this.learningModel = await DacshundModel.CreateDaschundModel(file);
        }
        async Task<DeviceInformation> GetFirstBackPanelVideoCaptureAsync()
        {
            var devices = await DeviceInformation.FindAllAsync(
                DeviceClass.VideoCapture);

            var device = devices.FirstOrDefault(
                d => d.EnclosureLocation.Panel == Windows.Devices.Enumeration.Panel.Back);

            return (device);
        }
        async void OnLoaded(object sender, RoutedEventArgs e)
        {
            await this.LoadModelAsync();

            var device = await this.GetFirstBackPanelVideoCaptureAsync();

            if (device != null)
            {
                await this.CreateMediaCaptureAsync(device);
                await this.mediaCapture.StartPreviewAsync();

                await this.CreateMediaFrameReaderAsync();
                await this.frameReader.StartAsync();
            }
        }

        async Task CreateMediaFrameReaderAsync()
        {
            var frameSource = this.mediaCapture.FrameSources.Where(
                source => source.Value.Info.SourceKind == MediaFrameSourceKind.Color).First();

            this.frameReader =
                await this.mediaCapture.CreateFrameReaderAsync(frameSource.Value);

            this.frameReader.FrameArrived += OnFrameArrived;
        }

        async Task CreateMediaCaptureAsync(DeviceInformation device)
        {
            this.mediaCapture = new MediaCapture();

            await this.mediaCapture.InitializeAsync(
                new MediaCaptureInitializationSettings()
                {
                    VideoDeviceId = device.Id
                }
            );
            // Try and set auto focus but on the Surface Pro 3 I'm running on, this
            // won't work.
            if (this.mediaCapture.VideoDeviceController.FocusControl.Supported)
            {
                await this.mediaCapture.VideoDeviceController.FocusControl.SetPresetAsync(FocusPreset.AutoNormal);
            }
            else
            {
                // Nor this.
                this.mediaCapture.VideoDeviceController.Focus.TrySetAuto(true);
            }
            this.captureElement.Source = this.mediaCapture;
        }

        async void OnFrameArrived(MediaFrameReader sender, MediaFrameArrivedEventArgs args)
        {
            if (Interlocked.CompareExchange(ref this.processingFlag, 1, 0) == 0)
            {
                try
                {
                    using (var frame = sender.TryAcquireLatestFrame())
                    using (var videoFrame = frame.VideoMediaFrame?.GetVideoFrame())
                    {
                        if (videoFrame != null)
                        {
                            // From the description (both visible in Python and through the
                            // properties of the model that I can interrogate with code at
                            // runtime here) my image seems to to be 227 by 227 which is an 
                            // odd size but I'm assuming that I should resize the frame here to 
                            // suit that. I'm also assuming that what I'm doing here is 
                            // expensive 

#if RESIZE
                            using (var resizedBitmap = await ResizeVideoFrame(videoFrame, IMAGE_SIZE, IMAGE_SIZE))
                            using (var resizedFrame = VideoFrame.CreateWithSoftwareBitmap(resizedBitmap))
                            {
                                this.inputData.data = resizedFrame;
#else       
                                this.inputData.data = videoFrame;
#endif // RESIZE

                                var evalOutput = await this.learningModel.EvaluateAsync(this.inputData);

                                await this.ProcessOutputAsync(evalOutput);

#if RESIZE
                            }
#endif // RESIZE
                        }
                    }
                }
                finally
                {
                    Interlocked.Exchange(ref this.processingFlag, 0);
                }
            }
        }
        string BuildOutputString(DacshundModelOutput evalOutput, string key)
        {
            var result = "no";

            if (evalOutput.loss[key] > 0.25f)
            {
                result = $"{evalOutput.loss[key]:N2}";
            }
            return (result);
        }
        async Task ProcessOutputAsync(DacshundModelOutput evalOutput)
        {
            string category = evalOutput.classLabel.FirstOrDefault() ?? "none";
            string dog = $"{BuildOutputString(evalOutput, "dog")}";
            string pony = $"{BuildOutputString(evalOutput, "pony")}";
            string dacshund = $"{BuildOutputString(evalOutput, "daschund")}";

            await this.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
                () =>
                {
                    this.Dog = dog;
                    this.Pony = pony;
                    this.Dacshund = dacshund;
                    this.Category = category;
                }
            );
        }

        /// <summary>
        /// This is horrible - I am trying to resize a VideoFrame and I haven't yet
        /// found a good way to do it so this function goes through a tonne of
        /// stuff to try and resize it but it's not pleasant at all.
        /// </summary>
        /// <param name="frame"></param>
        /// <param name="width"></param>
        /// <param name="height"></param>
        /// <returns></returns>
        async static Task<SoftwareBitmap> ResizeVideoFrame(VideoFrame frame, int width, int height)
        {
            SoftwareBitmap bitmapFromFrame = null;
            bool ownsFrame = false;

            if (frame.Direct3DSurface != null)
            {
                bitmapFromFrame = await SoftwareBitmap.CreateCopyFromSurfaceAsync(
                    frame.Direct3DSurface,
                    BitmapAlphaMode.Ignore);

                ownsFrame = true;
            }
            else if (frame.SoftwareBitmap != null)
            {
                bitmapFromFrame = frame.SoftwareBitmap;
            }

            // We now need it in a pixel format that an encoder is happy with
            var encoderBitmap = SoftwareBitmap.Convert(
                bitmapFromFrame, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            if (ownsFrame)
            {
                bitmapFromFrame.Dispose();
            }

            // We now need an encoder, should we keep creating it?
            var memoryStream = new MemoryStream();

            var encoder = await BitmapEncoder.CreateAsync(
                BitmapEncoder.JpegEncoderId, memoryStream.AsRandomAccessStream());

            encoder.SetSoftwareBitmap(encoderBitmap);
            encoder.BitmapTransform.ScaledWidth = (uint)width;
            encoder.BitmapTransform.ScaledHeight = (uint)height;

            await encoder.FlushAsync();

            var decoder = await BitmapDecoder.CreateAsync(memoryStream.AsRandomAccessStream());

            var resizedBitmap = await decoder.GetSoftwareBitmapAsync(
                BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            memoryStream.Dispose();

            encoderBitmap.Dispose();

            return (resizedBitmap);
        }
        void SetProperty<T>(ref T storage, T value, [CallerMemberName] string propertyName = null)
        {
            storage = value;
            this.PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
        }
        DacshundModelInput inputData;
        int processingFlag;
        MediaFrameReader frameReader;
        MediaCapture mediaCapture;
        DacshundModel learningModel;

        static readonly int IMAGE_SIZE = 227;
    }
}

In doing that, the main thing that I was unclear about was whether I need to resize the VideoFrames to fit with my model or whether I could leave them alone and have the code in between me and the model “do the right thing” with the VideoFrame?

Partly, that confusion comes from my model’s description seeming to be say that it was expecting frames at a resolution of 227 x 227 in BGR format and that feels like a very odd resolution to me.

Additionally, I found that trying to resize a VideoFrame seemed to be a bit of a painful task and I didn’t find a better way than going through a SoftwareBitmap with a BitmapEncoder, BitmapDecoder and a BitmapTransform.

The code that I ended up with got fairly ugly and I was never quite sure whether I needed it or not and so, for the moment, I conditionally compiled that code into my little test app so that I can switch between two modes of;

  • Pass the VideoFrame untouched to the underlying evaluation layer
  • Attempt to resize the VideoFrame to 227 x 227 before passing it to the underlying evaluation layer.

I’ve a feeling that it’s ok to leave the VideoFrame untouched but I’m about 20% sure on that at the time of writing and the follow on piece here assumes that I’m running with that version of the code.

Does It Work?

How does the app work out? I’m not yet sure Smile and there’s a couple of things where I’m not certain.

  • I’m running on a Surface Pro 3 where the camera has a fixed focus and it doesn’t do a great job of focusing on my images (given that I’ve no UI to control the focus) and so it’s hard to tell at times how good an image the camera is getting. I’ve tried it with both the front and back cameras on that device but I don’t see too much difference.
  • I’m unsure of whether the way in which I’m passing the VideoFrame to the model is right or not.

But I did run the app and presented it with 3 pictures – one of a dachshund, one of an alsatian (which it should understand is a dog but not a dachshund) and one of a pony.

Here’s some examples showing the sort of output that the app displayed;

dacs

I’m not sure about the category of ‘dog’ here but the app seems fairly confident that this is both a dog and a dachshund so that seems good to me.

Here’s another (the model has been trained on alsatian images to some extent);

als

and so that seems like a good result and then I held up my phone to the video stream displaying an image of a pony;

pony

and so that seems to work reasonably well and that’s the code which does not resize the image down to 227×227 and I found that the code that did resize didn’t seem to work the same way so maybe my notion of resizing (or the actual code which does the resizing) isn’t right.

Wrapping Up

First impressions here are very good Smile in that I managed to get something working in very short time.

Naturally, it’d be interesting to try and build a better understanding around the binding of parameters and I’d also be interested to try this out with a camera that was doing a better job of focusing.

It’d also be interesting to point the camera at real world objects rather than 2D pictures of those objects and so perhaps I need to build a model that classifies something a little more ‘household’ than dogs and ponies to make it easier to test without going out into a field Smile

I’d also like to try some of this out on other types of devices including HoloLens as/when that becomes possible.

Code

If you want the code that I put together here then it’s in this github repo;

https://github.com/mtaulty/WindowsMLExperiment

Keep in mind that this is just my first experiment, I’m muddling my way through and it looks like the code conditionally compiled out with the RESIZE constant can be ignored unless I hear otherwise and I’ll update the post if I do.

Lastly, you’ve probably noticed many different spellings of the word dachshund in the code and in the blog post – I should have stuck with poodles Smile

Rough Notes on UWP and webRTC (Part 2)

Following up on my previous post and very definitely staying in the realm of ‘rough notes’ I wanted to add a little more to the basic sample that I’d cooked up around UWP/webRTC.

In the previous sample, I’d gone to great lengths to make a UI that was almost impossible to use because I was putting the burden of signalling onto the user of the UI and so the user had to copy around long strings containing details of session descriptions and ICE servers and so on. The user was the ‘signalling server’ and it was a bit tedious to pretend to be a server but it did work.

This was useful to me as it let me try to understand what was going on, what a signalling server had to do for webRTC to work and also to reduce some of the code in the sample.

Ultimately, though if I’m going to make progress, I need a signalling server and I decided to simply re-use the console server which sits in this project within the UWP/webRTC PeerCC sample.

https://github.com/webrtc-uwp/PeerCC-Sample/tree/master/Server

Which also meant that I could re-use the code which talks to that server which is contained in the PeerCC sample in a couple of places;

Signalling.cs

Conductor.cs

I mostly didn’t take code from the Conductor class, only the Signalling class which I moved lock stock and barrel into my project after creating a new branch (Signalling) and removing most of the existing ‘UI’ that I had as it was largely related to manually copying around SDP strings and so on.

I hosted the console signalling server on a VM in Azure.

My new ‘UI’ simply contains boxes where I can type in the details of my signalling server;

image

and the intention of this little UI is to be ‘simple’ as possible rather than ‘comprehensive’ in any way and so the intended flow is as below;

  • Enter the IP address and port of the signalling service
  • Click the ‘Connect to Signalling…’ button.
  • The app will then connect to the signalling service and, if successful, it replaces the UI with 2 MediaElements, one for local video and one for remote.
  • The app then goes through the process of creating a Media object and using GetUserMedia() and associating the first local video track that it gets with a MediaElement in the UI so that the user can see their local video.
  • The app then waits for the signalling service to either
    • Deliver an offer from some other peer
      • In this case, the app creates the RTCPeerConnection, accepts the offer, creates its answer, sends it back via the signalling service and adds the first remote video track that comes in to a MediaElement on the UI so that the user can see the remote video.
    • Deliver a message telling it that some other peer is connected to the signalling service
      • If the ‘Is Initiator’ CheckBox is checked the app will then go and create the RTCPeerConnection, create an offer and send it over over the signalling service to the remote peer.
    • Deliver an answer from some other peer
      • This is assumed to be the response to the offer made above and so it is accepted.

And so the app is very simplistic in its approach and based on the ‘Is Initiator’ CheckBox will aggressively try and begin a session with the first peer that it sees connected to the signalling service which might not be a very realistic thing to do but it works for a basic sample.

With that in play, I can run two instances of this app on two machines, tell one to be the ‘Initiator’ and tell the other one not to be ‘Initiator’ and I get video and audio streams flowing between them.

Here’s a screenshot of my app doing this;

image

but because I re-used the signalling server that the original PeerCC sample used and because I didn’t do anything to the protocol that it uses on that server (other than add an extra piece of datta to it) I can use the original PeerCC sample to also communicate with my app and so here’s that sample sending/receiving video and audio to an instance of my app on another PC;

image

Now, it’s not exactly ‘an achievement’ to put someone else’s signalling server and their protocol code back into my sample here because my first sample was mostly about how to make webRTC work without that signalling server but, again, it’s a learning experience to take things apart and put them back together again.

I’d like to now refactor some code within that application as I’ve just let the code-behind file ‘mushroom’ a little and I think it I could make it better so I’ll perhaps revisit that ‘Signalling’ branch and improve things in the coming days.

Rough Notes on UWP and webRTC

In the last couple of days, I’ve been experimenting with webRTC as a means of getting live real-time-communication (voice, video, data) flowing between two Universal Windows Platform apps and I thought I’d start to share my experiments here.

There’s a big caveat in that these are rough notes as I’m very new to these pieces and so there’s probably quite a few mistakes in these posts that I’ll realise when I’ve spent more time on it but I quite like the approach of ‘learning in public’.

Why look at webRTC from the web as a technology for communications between native applications?

I think it comes down to;

  • it already exists and there’s lots of folks using it on the web so there’s a strong re-use argument.
  • there’s the chance of interoperability.

and last, but by no means least, there is already an implementation out there for webRTC in UWP Smile

Working out the webRTC Basics

One of the advantages around well-used web technologies is that they come with a tonne of resources and the primary one that I’ve been reading is this one;

Getting Started With WebRTC

which tells me about the architecture of webRTC and its use in implementing RTC in a browser context and there are no shortage of code labs to show you how to implement things in JavaScript in a browser.

Once a technology like this works on the web it’s not unnatural to want to try and make use of it in other contexts and so there are also lots of tutorials that talk about making use of webRTC inside of an Android app or an iOS app but I didn’t find so much around Windows with/without UWP.

The other thing that’s great about that ‘Getting Started’ page is that it tells you about the core pieces of webRTC;

  • MediaStream – a stream of media, synchronized audio and video
  • RTCPeerConnection – what seems to be the main object in the API involved in shifting media streams between peers
  • RTCDataChannel – the channel for data that doesn’t represent media (e.g. chat messages)

and that led me to lots of samples on the web such as this one;

A Dead Simple webRTC Example

and these samples are great on the one hand because they got me used to the idea of a ‘flow’ that happens between 2 browsers that want to use webRTC which in my head runs something like this;

  • both browsers take a look at their media capabilities in terms of audio, video streams via the getUserMedia API.
  • browser one can now make ‘an offer’ to browser two around the capabilities that it has (streams, codecs, bitrates, etc) and it does this via the RTCPeerConnection.CreateOffer() API with the results represented via SDP (Session Description Protocol)
    • browser one uses the PeerConnection.SetLocalDescription(Type: Offer) API to store this SDP as its local description.
  • browser two can import that ‘offer’ and can create ‘an answer’ via the RTCPeerConnection.CreateAnswer() API with another lump of SDP to describe its own capabilities.
    • browser two uses the PeerConnection.SetRemoteDescription(Type: Offer) API with what it received from its peer
    • browser two uses the PeerConnection.SetLocalDescription() API with the results from CreateAnswer()
  • browser one can import that ‘answer’ and perhaps the two endpoints can agree on some common means of communicating the audio/video streams.
    • browser one uses the PeerConnection.SetRemoteDescription(Type: Answer) API to store the answer that it got from the peer.

So, there’s this little dance between the two endpoints and one of the initially confusing things for me was that webRTC doesn’t dictate things like;

  • how browser one discovers that browser two might exist or be open to communicating with it in the first place.
  • how browser one and two swap address details so that they can ‘directly’ communicate with each other.
  • how browser one and two swap these ‘offers’ and ‘answers’ back and forth before they have figured out address details for each other – how they ‘talk’ when they can’t yet ‘talk’!

Instead, the specification calls that signalling and it’s left to the developer who is using webRTC to figure out how to implement it and if I go back to this article again;

A Dead Simple webRTC Example

then it took me a little time to figure out that when this article talks about ‘the server’, it is really talking about a specific implementation of a signalling server for webRTC and it uses a web socket server running on node.js to provide signalling but webRTC isn’t tied to that server or its implementation in any way – it just needs some implementation of signalling to work.

That seems like quite a lot to get your head around but there’s more details to get this type of communications working over the public internet.

More ‘Basics’ – webRTC and ICE, STUN, TURN Confused smile

In a simple world, two browsers that wanted to send audio/video streams back and forth would just be able to exchange IP addresses and port numbers and set up sockets to do the communications but that’s not likely to be possible on the internet.

That’s where the article…

WebRTC in the real world: STUN, TURN and signaling

comes in and does a great job of explaining what signalling is for and how additional protocols come into play trying to make this happen on the internet where devices are likely to be behind firewalls and NATs.

Specifically, the article explains that the ICE Framework is used to try and figure out the ‘most direct’ way for the two peers to talk to each other.

If the two peers were somehow able to make a direct host<->host connection (e.g. on a common network) then that’s what ICE seems to prefer to choose.

If it needs to, it can use Session Traversal Utilities for NAT (STUN) to deal with a host that has its address hidden behind a NAT.

Additionally, if it needs to, it can use Traversal Using Relays around NAT (TURN) for scenarios where it is not possible to do point<->point communications between the two hosts and a ‘relay’ (or man in the middle) server on the internet can be used to relay the messages between the two although, naturally, copying around media streams for lots of clients is likely to lead to a busy server and there’s the question of finding such a server and someone to pay for hosting it.

UWP and webRTC

With some of that background coming together for me, I turned my attention to the github project which has an implementation of webRTC for the UWP;

webRTC for UWP on Github

and I found it surprisingly approachable.

I cloned down the entire repository, installed Strawberry Perl in order to help me build it and followed the simple instructions of running the prepare.bat file to build it all out and then, as instructed, I opened up the solution which (as below) contains;

image

and so the first thing that surprised me here is that the various API pieces that I’d been reading about (RTCPeerConnection etc) look to be directly represented here and I think they are built out as a WinRT library by the Org.WebRtc project – i.e. that project seems to take the x-platform C++ pieces and wrapper them up for UWP use.

Then there’s the PeerCC folder which contains two samples. A server and a client.

It took me a little while to figure out that the server is just a simple socket server which runs as a signalling server;

image

and I think it’s a standalone executable as I copied it to a virtual machine and ran it in the cloud and it ‘just worked’ and you simply get a command line output like;

image

There’s then the client (UWP) side of this sample which is in the other project and runs up an interface as below;

image

this app then lets you enter the details of your signalling server (I don’t think 127.0.0.1 loopback will work so I didn’t try that) and then if you run the same app on another PC and point it at the same signalling server then you can very quickly get video & voice running between those two machines using this sample.

It’s important to say that in the screenshot above, I have deliberately removed the ICE servers that the sample runs with by default – it uses stun[1234].l.google.com:19302 when you run it up for the first time. I only removed them because I wanted to prove to myself that the sample didn’t need them in the case where the participants could make a direct connection to each other.

So, it was pretty easy to get hold of these bits and find a sample that worked but there’s a lot of code in that sample and I felt that I needed to unpick it a little as it seemed to be showcasing all the features but not really giving me an indication of the minimum amount of code to get this working.

Unpicking the Sample

I spent some time reading, running, debugging this sample and it’s well structured and the bits that I found most interesting were in 3 places;

image

The code in the Signalling folder represents two classes that do quite a lot of the work. The Signalling class knows how to send messages back/forth to the signalling socket server and it uses a really simple HTTP protocol operating on a ‘long poll’ such that;

  • New clients announces themselves to the server.
  • Each client polls the server on a long timeout, waiting to be told about other clients arriving/leaving and any new messages.

The Conductor class is a form of ‘controller’ which largely centralises the webRTC API calls for the app and is used a lot by the MainViewModel which takes parameters to/from the UI and passes them onto that Conductor to get things done.

This is really great but I still wanted a ‘simpler’ sample that captured more of the ‘essence’ of what was necessary here without getting bogged down in the details of signalling and ICE Servers and so on.

And so I wrote my own.

Making My Own Sample (using the NuGet package)

I made my own sample which is quite difficult to use but which allowed me to get started with the basics by taking away the need for signalling servers and ICE servers.

I felt that if I could do this to get more of a basic understanding of the essentials then I could add the other pieces afterwards and layer on the complexity.

In writing that sample, I initially worked in a project where I had my own C# project code alongside the the C++/CX code for the webRTC SDK so that I could use mixed-mode debugging and step through the underlying source as I made mistakes in using the APIs and that proved to be quite a productive approach.

However, as my C# code got a little closer to ‘working’ I switched from using the source code and/or the binaries I’d build from it and, instead, started using the webRTC SDK via the NuGet package that is shipped for it;

image

as that seemed to give Visual Studio less to think about when doing a rebuild on the project and simplified my dependencies.

While I’m trying to avoid having to use some kind of signalling service for my example, I still need something to transfer data between my two apps that want to communicate and so I figured that I would simply put the necessary data onto the screen and then copy it manually back and forwards between the two apps so that I become a form of human signalling service.

I made a new UWP project and switched on a number of capabilities to ensure that I didn’t bang up against problems (e.g. webcam, microphone, internet client/server and private network client/server).

I then constructed the ‘UI from hell’. The application runs up with an Initialise button only;

image

and, once initalised, it presents this confusing choice of buttons where only I know as the developer that there is a single, ‘safe’ path through pressing them Winking smile

image

and I can then use the Create Offer button to create an offer from this machine and populate the text block above with it;

image

now, as the ‘human signalling server’ I now have a responsibility to take this offer information (by copying it out of the text block) along with the list of Ice Candidates over to a copy of this same app on another machine.

I can do this via a networked clipboard or a file share or similar.

On that machine, I paste the offer as a ‘Remote Description’ as below;

1

and when I click the button, the app creates an answer for the offer;

2

and I can go back to the original machine and paste this as the answer to the offer;

image

and then I just need to swap over the details of the ICE candidates and I’ve got buttons to write/read these from a file;

image

and I create that file, copy it to the second machine and then use the ‘Add Remotes from File…’ button on that machine to add those remote candidates.

Now, I kind of expected to have to copy the ICE candidates in both directions but I find that once I have copied it from one app to the other, things seem to get figured out and, sure enough, here’s my hand waving at me from my other device;

image

and I’m getting both audio and video over that connection Smile

Now, clearly, manually moving these bits around over a network isn’t likely to be a realistic solution and I suspect that I still have a lot to learn here around the basics but I found it helpful as a way of exploring some of what was going on.

What’s surprising is how little code there is in my sample.

What Does the Code Look Like?

The ‘UI’ that I made here is largely just Buttons, TextBlocks, TextBoxes and a single MediaElement and I set the RealTimePlayback property on the MediaElement to True.

A lot of my code is then just property getters/setters and some callback functions for the UI but the main pieces of code end up looking something like this.

Initialisation

To get things going, I make use of a Media instance and an RTCPeerConnection instance and the code runs as below;

                // I find that if I don't do this before Initialize() then I crash.
                await WebRTC.RequestAccessForMediaCapture();

                WebRTC.Initialize(this.Dispatcher);

                RTCMediaStreamConstraints constraints = new RTCMediaStreamConstraints()
                {
                    audioEnabled = true,
                    videoEnabled = true
                };

                this.peerConnection = new RTCPeerConnection(
                    new RTCConfiguration()
                    {
                        // Hard-coding these for now...
                        BundlePolicy = RTCBundlePolicy.Balanced,

                        // I got this wrong for a long time. Because I am not using ICE servers
                        // I thought this should be 'NONE' but it shouldn't. Even though I am
                        // not going to add any ICE servers, I still need ICE in order to
                        // get candidates for how the 2 ends should talk to each other.
                        // Lesson learned, took a few hours to realise it 🙂
                        IceTransportPolicy = RTCIceTransportPolicy.All
                    }
                );

                this.media = Media.CreateMedia();
                this.userMedia = await media.GetUserMedia(constraints);

                this.peerConnection.AddStream(this.userMedia);
                this.peerConnection.OnAddStream += OnRemoteStreamAdded;
                this.peerConnection.OnIceCandidate += OnIceCandidate;

and so this is pretty simple – I use the Media.CreateMedia() function and then call GetUserMedia telling it that I want audio+video. I then create a RTCPeerConnection and I use AddStream to add my one stream and I handle a couple of events.

Creating an Offer

Once initialised, creating an offer is really simple. The code’s as below;

          // Create the offer.
                var description = await this.peerConnection.CreateOffer();

                // We filter some pieces out of the SDP based on what I think
                // aren't supported Codecs. I largely took it from the original sample
                // when things didn't work for me without it.
                var filteredDescriptionSdp = FilterToSupportedCodecs(description.Sdp);

                description.Sdp = filteredDescriptionSdp;

                // Set that filtered offer description as our local description.
                await this.peerConnection.SetLocalDescription(description);

                // Put it on the UI so someone can copy it.
                this.LocalOfferSdp = description.Sdp;

and so it’s very much like the JavaScript examples out there. The only thing I’d add is that I’m filtering out some of the codecs because I saw the original sample do this too.

Accepting an Offer

When a remote offer is pasted into the UI and the button pressed, it’s imported/accepted by code;

           // Take the description from the UI and set it as our Remote Description
            // of type 'offer'
            await this.SetSessionDescription(RTCSdpType.Offer, this.RemoteDescriptionSdp);

            // And create our answer
            var answer = await this.peerConnection.CreateAnswer();

            // And set that as our local description
            await this.peerConnection.SetLocalDescription(answer);

            // And put it back into the UI
            this.LocalAnswerSdp = answer.Sdp;

and so again there’s little code here beyond the flow.

Accepting an Answer

There’s very little code involved in taking an ‘answer’ from the screen and dealing with it – it’s essentially a call to RTCPeerConnection.SetRemoteDescription with the SDP of the answer and a type set to Answer and so I won’t list out code for that here.

Dealing with ICE Candidates

I spent quite some time with these APIs assuming incorrectly that if I wasn’t using some ICE server on the internet then I didn’t need to think about ICE at all here but it seems to turn out that ICE is the mechanism via which all potential means for communication between the two endpoints are worked out and so not handling the ICE candidates meant that I never got any communication.

I handle the ICE candidates very simply here. There’s an event on the RTCPeerConnection which fires when it comes up with an ICE candidate and all I do is handle that event and put the details into a string in the UI with some separators which (hopefully) don’t naturally show up in the strings that I’m using them to separate Confused smile

        void OnIceCandidate(RTCPeerConnectionIceEvent args)
        {
            this.IceCandidates += $"{args.Candidate.Candidate}|{args.Candidate.SdpMid}|{args.Candidate.SdpMLineIndex}\n";
        }

and I have some code which writes this string to a file when the UI asks it to and some more code which reads back from a file and adds the remote candidates. That code’s not very interesting so here’s the relevant piece in taking the text lines from the file it’s just read and reconstructing them into instances of RTCIceCandidate before adding them to the RTCPeerConnection;

            foreach (var line in lines)
            {
                var pieces = line.Split('|');

                if (pieces.Length == 3)
                {
                    RTCIceCandidate candidate = new RTCIceCandidate(
                        pieces[0], pieces[1], ushort.Parse(pieces[2]));

                    await this.peerConnection.AddIceCandidate(candidate);
                }
            }

Handling New Media Streams

Last but by no means least here is the act of handling a media stream when it ‘arrives’ from the remote peer.

I think that if I poked into the underlying APIs there’s some mechanism for getting hold of the raw streams here but it seems that the SDK has done some heavy lifting to at least make this very easy in the simple case in that there’s a method which looks to do the work of pairing up a media stream from webRTC with a MediaElement so that it takes it as a source.

So, the handler for RTCPeerConnection.RemoteStreamAdded just becomes;

        void OnRemoteStreamAdded(MediaStreamEvent args)
        {
            if (this.mediaElement.Source == null)
            {
                // Get the first video track that's present if any
                var firstTrack = args?.Stream?.GetVideoTracks().FirstOrDefault();

                if (firstTrack != null)
                {
                    // Link it up with the MediaElement that we have in the UI.
                    this.media.AddVideoTrackMediaElementPair(firstTrack, this.mediaElement, "a label");
                }
            }
        }

and I never imagined that would be quite as simple as it seems it could be.

Wrapping Up & Next Steps

As I said at the start of the post, this is just some rough notes as I try and figure my way around webRTC and the UWP webRTC SDK that’s out there on github.

It’s very possible that I’ve messed things up in the text above so feel free to tell me as I’m quite new to webRTC and this UWP SDK.

Writing this up though has been a useful exercise for me as I feel that I’ve got a handle on at least how to put together a ‘hello world’ demo with this SDK and I can perhaps now move on to look at some other topics with it.

The code for what I put together here is on github – keep in mind that the UI is ‘not so easy’ to use but if you follow the screenshots above then you can probably make it work if you have the motivation to do all the copying/pasting of information back/forth between different applications.

In terms of next steps, there’s some things I’d like to try;

  • As far as I know, this code would only work if the two apps communicating were able to directly communicate with each other but I don’t think it’s more than a line of code or two in order to enable them to connect over different networks including the internet and I want to try that out soon.
  • Naturally, I need to reinstate a signalling service here and perhaps I can do some work to come up with a signalling service abstraction which can then be implemented using whatever technology might suit.
  • I’d like to try and get one end of this communication working on a non-PC device and, particularly, a HoloLens.

But those are all for another post, this one is long enough already Smile