Third Experiment with Image Classification on Windows ML from UWP (on HoloLens in Unity)

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

Following up from this earlier post;

Second Experiment with Image Classification on Windows ML from UWP (on HoloLens)

I’d finished up that post by flagging that what I was doing with a 2D UI felt weird in that I was looking through my HoloLens at a 2D app which was then displaying the contents of the webcam on the HoloLens back to me and while things seemed to work fine, it felt like a hall of mirrors.

Moving the UI to an immersive 3D app built in something like Unity would make this a little easier to try out and that’s what this post is about.

Moving the code as I had it across to Unity hasn’t proved difficult at all.

I spun up a new Unity project and set it up for HoloLens development by setting the typical settings like;

  • Switching the target platform to UWP (I also switched to the .NET backend and its 4.6 support)
  • Switching on support for the Windows Mixed Reality SDK
  • Moving the camera to the origin, changing its clear flags to solid black and changing the near clipping plane to 0.85
  • Switching on the capabilities that let my app access the camera and the microphone

and, from there, I brought the .onnx file with my model in it and placed it as a resource in Unity;

image

and then I brought the code across from the XAML based UWP project in as much as I could, conditionally compiling most of it out with ENABLE_WINMD_SUPPORT constants as most of the code that I’m trying to run here is entirely UWP dependent and isn’t going to run in the Unity Editor and so on.

In terms of code, I ended up with only 2 code files;

image

the dachshund file started life by being generated for me in the first post in this series by the mlgen tool although I did have to alter it to get it to work after it had been generated.

The code uses the underlying LearningModelPreview class which claims to be able to load a model from a storage file and from a stream. Because in this instance inside of Unity I’m going to load the model using Unity’s Resource.Load() mechanism I’m going to end up with a byte[] for the model and so I wanted to feed it through into the LoadModelFromStreamAsync() method but I found this didn’t seem to be implemented yet and so I had to do a minor hack and write the byte array out to a file before feeding it to the LoadModelFromStorageFileAsync() method.

That left this piece of code looking as below;

#if ENABLE_WINMD_SUPPORT
namespace dachshunds.model
{
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Runtime.InteropServices.WindowsRuntime;
    using System.Threading.Tasks;

    using Windows.AI.MachineLearning.Preview;
    using Windows.Media;
    using Windows.Storage;
    using Windows.Storage.Streams;

    // MIKET: I renamed the auto generated long number class names to be 'Daschund'
    // to make it easier for me as a human to deal with them 🙂
    public sealed class DachshundModelInput
    {
        public VideoFrame data { get; set; }
    }

    public sealed class DachshundModelOutput
    {
        public IList<string> classLabel { get; set; }
        public IDictionary<string, float> loss { get; set; }

        public DachshundModelOutput()
        {
            this.classLabel = new List<string>();
            this.loss = new Dictionary<string, float>();

            // MIKET: I added these 3 lines of code here after spending *quite some time* 🙂
            // Trying to debug why I was getting a binding excption at the point in the
            // code below where the call to LearningModelBindingPreview.Bind is called
            // with the parameters ("loss", output.loss) where output.loss would be
            // an empty Dictionary<string,float>.
            //
            // The exception would be 
            // "The binding is incomplete or does not match the input/output description. (Exception from HRESULT: 0x88900002)"
            // And I couldn't find symbols for Windows.AI.MachineLearning.Preview to debug it.
            // So...this could be wrong but it works for me and the 3 values here correspond
            // to the 3 classifications that my classifier produces.
            //
            this.loss.Add("daschund", float.NaN);
            this.loss.Add("dog", float.NaN);
            this.loss.Add("pony", float.NaN);
        }
    }

    public sealed class DachshundModel
    {
        private LearningModelPreview learningModel;

        public static async Task<DachshundModel> CreateDachshundModel(byte[] bits)
        {
            // Note - there is a method on LearningModelPreview which seems to
            // load from a stream but I got a 'not implemented' exception and
            // hence using a temporary file.
            IStorageFile file = null;
            var fileName = "model.bin";

            try
            {
                file = await ApplicationData.Current.TemporaryFolder.GetFileAsync(
                    fileName);
            }
            catch (FileNotFoundException)
            {
            }
            if (file == null)
            {
                file = await ApplicationData.Current.TemporaryFolder.CreateFileAsync(
                    fileName);

                await FileIO.WriteBytesAsync(file, bits);
            }

            var model = await DachshundModel.CreateDachshundModel((StorageFile)file);

            return (model);
        }
        public static async Task<DachshundModel> CreateDachshundModel(StorageFile file)
        {
            LearningModelPreview learningModel = await LearningModelPreview.LoadModelFromStorageFileAsync(file);
            DachshundModel model = new DachshundModel();
            model.learningModel = learningModel;
            return model;
        }
        public async Task<DachshundModelOutput> EvaluateAsync(DachshundModelInput input) {
            DachshundModelOutput output = new DachshundModelOutput();
            LearningModelBindingPreview binding = new LearningModelBindingPreview(learningModel);
            binding.Bind("data", input.data);
            binding.Bind("classLabel", output.classLabel);

            // MIKET: this generated line caused me trouble. See MIKET comment above.
            binding.Bind("loss", output.loss);

            LearningModelEvaluationResultPreview evalResult = await learningModel.EvaluateAsync(binding, string.Empty);
            return output;
        }
    }
}
#endif // ENABLE_WINMD_SUPPORT

and then I made a few minor modifications to the code which had previously formed my ‘code behind’ in my XAML based app to move it into this MainScript.cs file where it performs pretty much the same function as it did in the XAML based app – getting frames from the webcam, passing them to the model for evaluation and then displaying the results. That code now looks like;

using System;
using System.Linq;
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

#if ENABLE_WINMD_SUPPORT
using System.Threading.Tasks;
using Windows.Devices.Enumeration;
using Windows.Media.Capture;
using Windows.Media.Capture.Frames;
using Windows.Media.Devices;
using Windows.Storage;
using dachshunds.model;
using System.Diagnostics;
using System.Threading;
#endif // ENABLE_WINMD_SUPPORT

public class MainScript : MonoBehaviour
{
    public TextMesh textDisplay;

#if ENABLE_WINMD_SUPPORT
    public MainScript ()
	{
        this.inputData = new DachshundModelInput();
        this.timer = new Stopwatch();
	}
    async void Start()
    {
        await this.LoadModelAsync();

        var device = await this.GetFirstBackPanelVideoCaptureAsync();

        if (device != null)
        {
            await this.CreateMediaCaptureAsync(device);

            await this.CreateMediaFrameReaderAsync();
            await this.frameReader.StartAsync();
        }
    }    
    async Task LoadModelAsync()
    {
        // Get the bits from Unity's resource system :-S
        var modelBits = Resources.Load(DACHSHUND_MODEL_NAME) as TextAsset;

        this.learningModel = await DachshundModel.CreateDachshundModel(
            modelBits.bytes);
    }
    async Task<DeviceInformation> GetFirstBackPanelVideoCaptureAsync()
    {
        var devices = await DeviceInformation.FindAllAsync(
            DeviceClass.VideoCapture);

        var device = devices.FirstOrDefault(
            d => d.EnclosureLocation.Panel == Windows.Devices.Enumeration.Panel.Back);

        return (device);
    }
    async Task CreateMediaFrameReaderAsync()
    {
        var frameSource = this.mediaCapture.FrameSources.Where(
            source => source.Value.Info.SourceKind == MediaFrameSourceKind.Color).First();

        this.frameReader =
            await this.mediaCapture.CreateFrameReaderAsync(frameSource.Value);

        this.frameReader.FrameArrived += OnFrameArrived;
    }

    async Task CreateMediaCaptureAsync(DeviceInformation device)
    {
        this.mediaCapture = new MediaCapture();

        await this.mediaCapture.InitializeAsync(
            new MediaCaptureInitializationSettings()
            {
                VideoDeviceId = device.Id
            }
        );
        // Try and set auto focus but on the Surface Pro 3 I'm running on, this
        // won't work.
        if (this.mediaCapture.VideoDeviceController.FocusControl.Supported)
        {
            await this.mediaCapture.VideoDeviceController.FocusControl.SetPresetAsync(FocusPreset.AutoNormal);
        }
        else
        {
            // Nor this.
            this.mediaCapture.VideoDeviceController.Focus.TrySetAuto(true);
        }
    }

    async void OnFrameArrived(MediaFrameReader sender, MediaFrameArrivedEventArgs args)
    {
        if (Interlocked.CompareExchange(ref this.processingFlag, 1, 0) == 0)
        {
            try
            {
                using (var frame = sender.TryAcquireLatestFrame())
                using (var videoFrame = frame.VideoMediaFrame?.GetVideoFrame())
                {
                    if (videoFrame != null)
                    {
                        // From the description (both visible in Python and through the
                        // properties of the model that I can interrogate with code at
                        // runtime here) my image seems to to be 227 by 227 which is an 
                        // odd size but I'm assuming the underlying pieces do that work
                        // for me.
                        // If you've read the blog post, I took out the conditional
                        // code which attempted to resize the frame as it seemed
                        // unnecessary and confused the issue!
                        this.inputData.data = videoFrame;

                        this.timer.Start();
                        var evalOutput = await this.learningModel.EvaluateAsync(this.inputData);
                        this.timer.Stop();
                        this.frameCount++;

                        await this.ProcessOutputAsync(evalOutput);
                    }
                }
            }
            finally
            {
                Interlocked.Exchange(ref this.processingFlag, 0);
            }
        }
    }
    string BuildOutputString(DachshundModelOutput evalOutput, string key)
    {
        var result = "no";

        if (evalOutput.loss[key] > 0.25f)
        {
            result = $"{evalOutput.loss[key]:N2}";
        }
        return (result);
    }
    async Task ProcessOutputAsync(DachshundModelOutput evalOutput)
    {
        string category = evalOutput.classLabel.FirstOrDefault() ?? "none";
        string dog = $"{BuildOutputString(evalOutput, "dog")}";
        string pony = $"{BuildOutputString(evalOutput, "pony")}";

        // NB: Spelling mistake is built into model!
        string dachshund = $"{BuildOutputString(evalOutput, "daschund")}";
        string averageFrameDuration =
            this.frameCount == 0 ? "n/a" : $"{(this.timer.ElapsedMilliseconds / this.frameCount):N0}";

        UnityEngine.WSA.Application.InvokeOnAppThread(
            () =>
            {
                this.textDisplay.text = 
                    $"dachshund {dachshund} dog {dog} pony {pony}\navg time {averageFrameDuration}";
            },
            false
        );
    }
    DachshundModelInput inputData;
    int processingFlag;
    MediaFrameReader frameReader;
    MediaCapture mediaCapture;
    DachshundModel learningModel;
    Stopwatch timer;
    int frameCount;
    static readonly string DACHSHUND_MODEL_NAME = "dachshunds"; // .bytes file in Unity

#endif // ENABLE_WINMD_SUPPORT
}

while experimenting with this code, it certainly occurred to me that I could move it to more of a “pull” model inside of Unity by trying to grab frames in an Update() method rather than do the work separately and then pushing the results back to the App thread. It also occurred to me that the code is very single threaded and simply drops frames if it is ‘busy’ whereas it could be smarter and process them on some other thread including perhaps a thread from the thread pool. There are lots of possibilities Smile

In terms of displaying the results inside of Unity – I no longer need to display a preview from the webcam because my eyes are already seeing the same thing that the camera sees and so I’m just left with the challenge of displaying some text and so I just added a 3D Text object into the scene and made it accessible via a public field that can be set up in the editor.

image

and the ScriptHolder there is just a place to put my MainScript and pass it this TextMesh to display text in;

image

and that’s pretty much it.

I still see a fairly low processing rate when running on the device and I haven’t yet looked at that but here’s some screenshots of me looking at photos from Bing search on my 2nd monitor while running the app on HoloLens.

In this case the device (on my head) is around 40cm from the 24 inch monitor and I’ve got the Bing search results displaying quite large and the model seems to do a decent job of spotting dachshunds…

image

image

image

and dogs in general (although it has only really been trained on alsatians so it knows that they are dogs but not dachshunds);

image

and for whatever reason that I can’t explain I also trained it on ponies so it’s quite good at spotting those;

image

image

This works pretty well for me Smile I need to revisit and take a look at whether I can improve the processing speed and also the problem that I flagged in my previous post around not being able to run a release build but, otherwise, it feels like progress.

The code is in the same repo as it was before – I just added a Unity project to the repo.

https://github.com/mtaulty/WindowsMLExperiment

Conversations with the Language Understanding (LUIS) Service from Unity in Mixed Reality Apps

I’ve written quite a bit on this blog about speech interactions in the past and elsewhere like these articles that I wrote for the Windows blog a couple of years ago;

Using speech in your UWP apps- It’s good to talk

Using speech in your UWP apps- From talking to conversing

Using speech in your UWP apps- Look who’s talking

which came out of earlier investigations that I did for this blog like this post;

Speech to Text (and more) with Windows 10 UWP & ‘Project Oxford’

and we talked about Speech in our Channel9 show again a couple of years ago now;

image

and so I won’t rehash the whole topic here of speech recognition and understanding but in the last week I’ve been working on a fairly simple scenario that I thought I would share the code from.

Backdrop – the Scenario

The scenario involved a Unity application built against the “Stable .NET 3.5 Equivalent” scripting runtime which targets both HoloLens and immersive Windows Mixed Reality headsets where there was a need to use natural language instructions inside of the app.

That is, there’s a need to;

  1. grab audio from the microphone.
  2. turn the audio into text.
  3. take the text and derive the user’s intent from the spoken text.
  4. drive some action inside of the application based on that intent.

It’s fairly generic although the specific application is quite exciting but in order to get this implemented there’s some choices around technologies/APIs and whether functionality happens in the cloud or at the edge.

Choices

When it comes to (2) there’s a couple of choices in that there are layered Unity/UWP APIs that can make this happen and the preference in this scenario would be to use the Unity APIs which are the KeywordRecognizer and the DictationRecognizer for handling short/long chunks of speech respectively.

Those APIs are packaged so as to wait for a reasonable, configurable period of time for some speech to occur before delivering a ‘speech occurred’ type event to the caller passing the text that has been interpreted from the speech. 

There’s no cost (beyond on-device resources) to using these APIs and so in a scenario which only went as far as speech-to-text it’d be quite reasonable to have these types of APIs running all the time gathering up text and then having the app decide what to do with it.

However, when it comes to (3), the API of choice is LUIS which can take a piece of text like;

“I’d like to order a large pepperoni pizza please”

and can turn it into something like;

Intent: OrderPizza

Entity: PizzaType (Pepperoni)

Entity: Size (Large)

Confidence: 0.85

and so it’s a very useful thing as it takes the task of fathoming all the intricacies of natural language away from the developer.

This poses a bit of a challenge though for a ‘real time’ app in that it’s not reasonable to take every speech utterance that the user delivers and run it through the LUIS cloud service. There’s a number of reasons for that including;

  1. The round-trip time from the client to the service is likely to be fairly long and so, without care, the app would have many calls in flight leading to problems with response time and complicating the code and user experience.
  2. The service has a financial cost.
  3. The user may not expect or want all of their utterances to be run through the cloud.

Consequently, it seems sensible to have some trigger in an app which signifies that the user is about to say something that is of meaning to the app and which should be sent off to the LUIS service for examination. In short, it’s the;

“Hey, Cortana”

type key phrase that lets the system know that the user has something to say.

This can be achieved in a Unity app targeting .NET 3.5 by having the KeywordRecognizer class work in conjunction with the DictationRecognizer class such that the former listens for the speech keyword (‘hey, Cortana!’) and the latter then springs into life and listens for the dictated phrase that the user wants to pass on to the app.

As an aside, it’s worth flagging that these classes are only supported by Unity on Windows 10 as detailed in the docs and that there is an isSupported flag to let the developer test this at runtime.

There’s another aside to using these two classes together in that the docs here note that different types of recognizer cannot be instantiated at once and that they rely on an underlying PhraseRecognitionSystem and that the system has to be Shutdown in order to switch between one type of recognizer and another.

Later on in the post, I’ll return to the idea of making a different choice around turning speech to text but for the moment, I moved forward with the DictationRecognizer.

Getting Something Built

Some of that took a little while to figure out but once it’s sorted it’s “fairly” easy to write some code in Unity which uses a KeywordRecognizer to switch on/off a DictationRecognizer in an event-driven loop so as to gather dictated text.

I chose to have the notion of a DictationSink which is just something that receives some text from somewhere. It could have been an interface but I thought that I’d bring in MonoBehavior;

using UnityEngine;

public class DictationSink : MonoBehaviour
{
    public virtual void OnDictatedText(string text)
    {
    }
}

and so then I can write a DictationSource which surfaces a few properties from the underlying DictationRecognizer and passes on recognized text to a DictationSink;

using System;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class DictationSource : MonoBehaviour
{
    public event EventHandler DictationStopped;

    public float initialSilenceSeconds;
    public float autoSilenceSeconds;
    public DictationSink dictationSink;
   
    // TODO: Think about whether this should be married with the notion of
    // a focused object rather than just some 'global' entity.

    void NewRecognizer()
    {
        this.recognizer = new DictationRecognizer();
        this.recognizer.InitialSilenceTimeoutSeconds = this.initialSilenceSeconds;
        this.recognizer.AutoSilenceTimeoutSeconds = this.autoSilenceSeconds;
        this.recognizer.DictationResult += OnDictationResult;
        this.recognizer.DictationError += OnDictationError;
        this.recognizer.DictationComplete += OnDictationComplete;
        this.recognizer.Start();
    }
    public void Listen()
    {
        this.NewRecognizer();
    }
    void OnDictationComplete(DictationCompletionCause cause)
    {
        this.FireStopped();
    }
    void OnDictationError(string error, int hresult)
    {
        this.FireStopped();
    }
    void OnDictationResult(string text, ConfidenceLevel confidence)
    {
        this.recognizer.Stop();

        if ((confidence == ConfidenceLevel.Medium) ||
            (confidence == ConfidenceLevel.High) &&
            (this.dictationSink != null))
        {
            this.dictationSink.OnDictatedText(text);
        }
    }
    void FireStopped()
    {
        this.recognizer.DictationComplete -= this.OnDictationComplete;
        this.recognizer.DictationError -= this.OnDictationError;
        this.recognizer.DictationResult -= this.OnDictationResult;
        this.recognizer = null;

        // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
        // The challenge we have here is that we want to use both a KeywordRecognizer
        // and a DictationRecognizer at the same time or, at least, we want to stop
        // one, start the other and so on.
        // Unity does not like this. It seems that we have to shut down the 
        // PhraseRecognitionSystem that sits underneath them each time but the
        // challenge then is that this seems to stall the UI thread.
        // So far (following the doc link above) the best plan seems to be to
        // not call Stop() on the recognizer or Dispose() it but, instead, to
        // just tell the system to shutdown completely.
        PhraseRecognitionSystem.Shutdown();

        if (this.DictationStopped != null)
        {
            // And tell any friends that we are done.
            this.DictationStopped(this, EventArgs.Empty);
        }
    }
    DictationRecognizer recognizer;
}

notice in that code my attempt to use PhraseRecognitionSystem.Shutdown() to really stop this recognizer when I’ve processed a single speech utterance from it.

I need to switch this recognition on/off in response to a keyword being spoken by the user and so I wrote a simple KeywordDictationSwitch class which tries to do this using KeywordRecognizer with a few keywords;

using System.Linq;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class KeywordDictationSwitch : MonoBehaviour
{
    public string[] keywords = { "ok", "now", "hey", "listen" };
    public DictationSource dictationSource;

    void Start()
    {
        this.NewRecognizer();
        this.dictationSource.DictationStopped += this.OnDictationStopped;
    }
    void NewRecognizer()
    {
        this.recognizer = new KeywordRecognizer(this.keywords);
        this.recognizer.OnPhraseRecognized += this.OnPhraseRecgonized;
        this.recognizer.Start();
    }
    void OnDictationStopped(object sender, System.EventArgs e)
    {
        this.NewRecognizer();
    }
    void OnPhraseRecgonized(PhraseRecognizedEventArgs args)
    {
        if (((args.confidence == ConfidenceLevel.Medium) ||
            (args.confidence == ConfidenceLevel.High)) &&
            this.keywords.Contains(args.text.ToLower()) &&
            (this.dictationSource != null))
        {
            this.recognizer.OnPhraseRecognized -= this.OnPhraseRecgonized;
            this.recognizer = null;

            // https://docs.microsoft.com/en-us/windows/mixed-reality/voice-input-in-unity
            // The challenge we have here is that we want to use both a KeywordRecognizer
            // and a DictationRecognizer at the same time or, at least, we want to stop
            // one, start the other and so on.
            // Unity does not like this. It seems that we have to shut down the 
            // PhraseRecognitionSystem that sits underneath them each time but the
            // challenge then is that this seems to stall the UI thread.
            // So far (following the doc link above) the best plan seems to be to
            // not call Stop() on the recognizer or Dispose() it but, instead, to
            // just tell the system to shutdown completely.
            PhraseRecognitionSystem.Shutdown();

            // And then start up the other system.
            this.dictationSource.Listen();
        }
        else
        {
            Debug.Log(string.Format("Dictation: Listening for keyword {0}, heard {1} with confidence {2}, ignored",
                this.keywords,
                args.text,
                args.confidence));
        }
    }
    void StartDictation()
    {
        this.dictationSource.Listen();
    }
    KeywordRecognizer recognizer;
}

and once again I’m going through some steps to try and switch the KeywordRecognizer on/off here so that I can then switch the DictationRecognizer on/off as simply calling Stop() on a recognizer isn’t enough.

With this in place, I can now stack these components in Unity and have them use each other;

image

and so now I’ve got some code that listens for keywords, switches dictation on, listens for dictation and then passes that on to some DictationSink.

That’s a nice place to implement some LUIS functionality.

In doing so, I ended up writing perhaps more code than I’d liked as I’m not sure whether there is a LUIS library that works from a Unity environment targeting the Stable .NET 3.5 subset. I’ve found this to be a challenge with calling a few Azure services from Unity and LUIS doesn’t seem to be an exception in that there are client libraries on NuGet for most scenarios but I don’t think that they work in Unity (I could be wrong) and there aren’t generally examples/samples for Unity.

So…I rolled some small pieces of my own here which isn’t so hard when the call that we need here with LUIS is just a REST call.

Based on the documentation around the most basic “GET” functionality as detailed in the LUIS docs here,  I wrote some classes to represent the LUIS results;

using System;
using System.Linq;

namespace LUIS.Results
{
    [Serializable]
    public class QueryResultsIntent
    {
        public string intent;
        public float score;
    }
    [Serializable]
    public class QueryResultsResolution
    {
        public string[] values;

        public string FirstOrDefaultValue()
        {
            string value = string.Empty;
            
            if (this.values != null)
            {
                value = this.values.FirstOrDefault();
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResultsEntity
    {
        public string entity;
        public string type;
        public int startIndex;
        public int endIndex;
        public QueryResultsResolution resolution;

        public string FirstOrDefaultResolvedValue()
        {
            var value = string.Empty;

            if (this.resolution != null)
            {
                value = this.resolution.FirstOrDefaultValue();
            }

            return (value);
        }
        public string FirstOrDefaultResolvedValueOrEntity()
        {
            var value = this.FirstOrDefaultResolvedValue();

            if (string.IsNullOrEmpty(value))
            {
                value = this.entity;
            }
            return (value);
        }
    }
    [Serializable]
    public class QueryResults
    {
        public string query;
        public QueryResultsEntity[] entities;
        public QueryResultsIntent topScoringIntent;
    }
}

and then wrote some code to represent a Query of the LUIS service. I wrote this on top of pieces that I borrowed from my colleague, Dave’s, repo over here in github which provides some Unity compatible REST pieces with JSON serialization etc.

using LUIS.Results;
using RESTClient;
using System;
using System.Collections;

namespace LUIS
{
    public class Query
    {
        string serviceBaseUrl;
        string serviceKey;

        public Query(string serviceBaseUrl,
            string serviceKey)
        {
            this.serviceBaseUrl = serviceBaseUrl;
            this.serviceKey = serviceKey;
        }
        public IEnumerator Get(Action<IRestResponse<QueryResults>> callback)
        {
            var request = new RestRequest(this.serviceBaseUrl, Method.GET);

            request.AddQueryParam("subscription-key", this.serviceKey);
            request.AddQueryParam("q", this.Utterance);
            request.AddQueryParam("verbose", this.Verbose.ToString());
            request.UpdateRequestUrl();

            yield return request.Send();

            request.ParseJson<QueryResults>(callback);
        }        
        public bool Verbose
        {
            get;set;
        }
        public string Utterance
        {
            get;set;
        }
    }
}

and so now I can Query LUIS and get results back and so it’s fairly easy to put this into a DictationSink which passes the dictated speech in text form off to LUIS;

using LUIS;
using LUIS.Results;
using System;
using System.Linq;
using UnityEngine.Events;

[Serializable]
public class QueryResultsEventType : UnityEvent<QueryResultsEntity[]>
{
}

[Serializable]
public class DictationSinkHandler
{
    public string intentName;
    public QueryResultsEventType intentHandler;
}

public class LUISDictationSink : DictationSink
{
    public float minimumConfidenceScore = 0.5f;
    public DictationSinkHandler[] intentHandlers;
    public string luisApiEndpoint;
    public string luisApiKey;

    public override void OnDictatedText(string text)
    {
        var query = new Query(this.luisApiEndpoint, this.luisApiKey);

        query.Utterance = text;

        StartCoroutine(query.Get(
            results =>
            {
                if (!results.IsError)
                {
                    var data = results.Data;

                    if ((data.topScoringIntent != null) &&
                        (data.topScoringIntent.score > this.minimumConfidenceScore))
                    {
                        var handler = this.intentHandlers.FirstOrDefault(
                            h => h.intentName == data.topScoringIntent.intent);

                        if (handler != null)
                        {
                            handler.intentHandler.Invoke(data.entities);
                        }
                    }
                }
            }
        ));
    }
}

and this is really just a map which takes a look at the confidence score provided by LUIS, makes sure that it is high enough for our purposes and then looks into a map which maps between the names of the LUIS intents and a function which handles that intent set up here as a UnityEvent<T> so that it can be configured in the editor.

So, in use if I have some LUIS model which has intents named Create, DeleteAll and DeleteType then I can configure up an instance of this LUISDictationSink in Unity as below to map these to functions inside of a class named LUISIntentHandlers in this case;

image

and then a handler for this type of interaction might look something like;

    public void OnIntentCreate(LUIS.Results.QueryResultsEntity[] entities)
    {
        // We need two pieces of information here - the shape type and
        // the distance.
        var entityShapeType = entities.FirstOrDefault(e => e.type == "shapeType");
        var entityDistance = entities.FirstOrDefault(e => e.type == "builtin.number");

	// ...
    }

and this all works fine and completes the route that goes from;

keyword recognition –> start dictation –> end dictation –> LUIS –> intent + entities –> handler in code –> action

Returning to Choices – Multi-Language & Dictation in the Cloud

I now have some code that works and it feels like the pieces are in the ‘best’ place in that I’m running as much as possible on the device and hopefully only calling the cloud when I need to. That said, if I could get the capabilities of LUIS offline and run then on the device then I’d like to do that too but it’s not something that I think you can do right now with LUIS.

However, there is one limit to what I’m currently doing which isn’t immediately obvious and it’s that it is limited in terms of offering the possibility of non-English languages and, specifically, on HoloLens where (as far as I know) the recognizer classes only offer English support.

So, to support other languages I’d need to do my speech to text work via some other route – I can’t rely on the DictationRecognizer alone.

As an aside, it’s worth saying that I think multi-language support would need more work than just getting the speech to text to work in another language.

I think it would also require building a LUIS model in another language but that’s something that could be done.

An alternate way of performing speech-to-text that does support multiple languages would be to bring in a cloud powered speech to text API like the Cognitive Service Speech API and I could bring that into my code here by wrapping it up as a new type of DictationSource.

That speech-to-text API has some different ways of working. Specifically it can perform speech to text by;

  • Submitting an audio file in a specified format to a REST endpoint and getting back text.
  • Opening a websocket and sending chunks of streamed speech data up to the service to get back responses.

Of the two, the second has the advantage that it can be a bit smarter around detecting silence in the stream and it can also offer interim ‘hypotheses’ around what is being said before it delivers its ultimate view of what the utterance was. It can also support longer sections of speech than the file-based method.

So, this feels like a good way to go as an alternate DictationSource for my code.

However, making use of that API requires sending a stream of audio data to the cloud down a websocket in a format that is compatible with the service on the other end of the wire and that’s code I’d like to avoid writing. Ideally, it feels like the sort of code that one developer who was close to the service would write once and everyone would then re-use.

That work is already done if you’re using the service from .NET and you’re in a situation where you can make use of the client library that wrappers up the service access but I don’t think that it’s going to work for me from Unity when targeting the “Stable .NET 3.5 Equivalent” scripting runtime.

So…for this post, I’m going to leave that as a potential ‘future exercise’ that I will try to return back to if time permits and I’ll update the post if I do so.

In the meantime, here’s the code.

Code

If you’re interested in the code then it’s wrapped up in a simple Unity project that’s here on github;

http://github.com/mtaulty/LUISPlayground

That code is coupled to a LUIS service which has some very basic intents and entities around creating simple Unity game objects (spheres and cubes) at a certain distance in front of the user. It’s very rough.

There are three intents inside of this service. One is intended to create objects with utterances like “I want to create a cube 2 metres away”

image

and then it’s possible to delete everything that’s been created with a simple utterance;

image

and lastly it’s possible to get rid of just the spheres/cubes with a different intent such as “get rid of all the cubes”;

image

If you wanted to make the existing code run then you’d need an API endpoint and a service key for such a service and so I’ve exported the service itself from LUIS as a JSON export into this file in the repo;

image

so it should be possible to go to the LUIS portal and import that as a service;

image

and then plug in the endpoint and service key into the code here;

image

First Experiment with Image Classification on Windows ML from UWP

There are a broad set of scenarios that are enabled by making calls into the intelligent cloud-services offered by Cognitive Services around vision, speech, knowledge, search and language.

I’ve written quite a lot about those services in the past on this blog and I showed them at events and in use on the Context show that I used to make for Channel 9 around ever more personal and contextual computing.

In the show, we often talked about what could be done in the cloud alongside what might be done locally on a device and specifically we looked at UWP (i.e. on device) support for speech and facial detection and we dug into using depth cameras and runtimes like the Intel RealSense cameras and Kinect sensors for face, hand and body tracking. Some of those ‘special camera’ capabilities have most recently been surfaced again by Project Gesture (formerly ‘Prague’) and I’ve written about some of that too.

I’m interested in these types of technologies and, against that backdrop, I was very excited to see the announcement of the;

AI Platform for Windows Developers

which brings to the UWP the capability to run pre-trained learning models inside an app running on Windows devices including (as the blog post that I referenced says) on HoloLens and IoT where you can think of a tonne of potential scenarios. I’m particularly keen to think about this on HoloLens where the device is making decisions around the user’s context in near-real-time and so being able to make low-latency calls for intelligence is likely to be very powerful.

The announcement was also quite timely for me as recently I’d got a bit frustrated (Winking smile!) around the UWP’s lack of support for this type of workload – a little background …

Recent UK University Hacks on Cognitive Services

I took part in a couple of hack events at UK universities in the past couple of months that were themed around cognitive services and had a great time watching and helping students hack on the services and especially the vision services.

As part of preparing for those hacks, I made use of the “Custom Vision” service for image classification for the first time;

image

and I found it to be a really accessible service to make use of and I very quickly managed to build an image classification model which I trained over a number of iterations to differentiate between pictures which contained either dachshund dogs, ponies or some other type of dog although I didn’t train on too many non – dachshunds and so the model is a little weak in that area.

Here’s the portal where I have my dachshund recognition project going on;

image

and it works really well and I found it very easy to put together and you could put together your own classifier by following the tutorial here;

Overview of building a classifier with Custom Vision

and as part of the hack I watched a lot of participating students make use of the custom vision service and then realise that they wanted this functionality available on their device rather than just in the cloud and they followed the guidance here;

Export your model to mobile

to take the model that had been produced and export it so that they could make use of it locally inside apps running on Android or iOS via the export function;

image

and my frustration in looking at these pieces was that I had absolutely no idea how I would export one of these models and use it within an app running on the Universal Windows Platform.

Naturally, it’s easy to understand why iOS and Android support was added here but I was really pleased to see that announcement around Windows ML Smile and I thought that I’d try it out by taking my existing dachshund classification model built and trained in the cloud and seeing if I could run it against a video stream inside of a Windows 10 UWP app.

Towards that end, I produced a new iteration of my model trained on the “General (compact) domain” so that it could be exported;

image

and then I used the “Export” menu to save it to my desktop in CoreML format named dachshund.mlmodel.

Checking out the Docs

I had a good look through the documentation around Windows Machine Learning here;

Machine Learning

and set about trying to get the right bits of software together to see if I could make an app.

Getting the Right Bits

Operating System and SDK

At the time of writing, I’m running Windows 10 Fall Creators Update (16299) as my main operating system and support for these new Windows ML capabilities are coming in the next update which is in preview right now.

Consequently, I had some work to do to get the right OS and SDKs;

  • I moved a machine to the Windows Insider Preview 17115.1 via Windows Update
  • I grabbed the Windows 10 SDK 17110 Preview from the Insiders site.

Python and Machine Learning

I didn’t have an Python installation on the machine in question so I went and grabbed Python 2.7 from https://www.python.org/. I did initially try 3.6 but had some problems with scripts on that and, as a non-Python person, I came to the conclusion that the best plan might be to try 2.7 which did seem to work for me.

I knew that I need to convert my model from CoreML to ONNX and so I followed the document here;

Convert a model

to set about that process and that involved doing some pip installs and, specfically, for me I ended up running;

pip install coremltools

pip install onnxmltools

pip install winmltools

and that seemed to give me all that I needed to try and convert my model.

Converting the Model to ONNX

Just as the docs described, I ended up running these commands to do that conversion in a python environment;


from coremltools.models.utils import load_spec
from winmltools import convert_coreml

from winmltools.utils import save_model

model_coreml = load_spec(‘c:\users\mtaul\desktop\dachshunds.mlmodel’)

model_onnx = convert_coreml(model_coreml)

save_model (model_onnx, ‘c:\users\mtaul\desktop\dachshunds.onnx’)

and that all seemed to work quite nicely. I also took a look at my original model_coreml.description which gave me;

input {
   name: “data”
   type {
     imageType {
       width: 227
       height: 227
       colorSpace: BGR
     }
   }
}
output {
   name: “loss”
   type {
     dictionaryType {
       stringKeyType {
       }
     }
   }
}
output {
   name: “classLabel”
   type {
     stringType {
     }
   }
}
predictedFeatureName: “classLabel”
predictedProbabilitiesName: “loss”

which seemed reasonable but I’m not really qualified to know whether it was exactly right or not – the mechanics of these models are a bit beyond me at the time of writing Smile

Having converted my model, though, I thought that I’d see if I could write some code against it.

Generating Code

I’d read about a code generation step in the document here;

Automatic Code Generation

and so I tried to use the mlgen tool on my .onnx model to generate some code. This was pretty easy and I just ran the command line;

“c:\Program Files (x86)\Windows Kits\10\bin\10.0.17110.0\x86\mlgen.exe” -i dachshunds.onnx -l CS -n “dachshunds.model” -o dachshunds.cs

and it spat out some C# code (it also does CPPCX) which is fairly short and which you could fairly easily construct yourself by looking at the types in the Windows.AI.MachineLearning.Preview namespace.

The C# code contained some machine generated names and so I replaced those and this is the code which I ended up with;

namespace daschunds.model
{
    using System;
    using System.Collections.Generic;
    using System.Threading.Tasks;
    using Windows.AI.MachineLearning.Preview;
    using Windows.Media;
    using Windows.Storage;

    // MIKET: I renamed the auto generated long number class names to be 'Daschund'
    // to make it easier for me as a human to deal with them 🙂
    public sealed class DacshundModelInput
    {
        public VideoFrame data { get; set; }
    }

    public sealed class DacshundModelOutput
    {
        public IList<string> classLabel { get; set; }
        public IDictionary<string, float> loss { get; set; }
        public DacshundModelOutput()
        {
            this.classLabel = new List<string>();
            this.loss = new Dictionary<string, float>();

            // MIKET: I added these 3 lines of code here after spending *quite some time* 🙂
            // Trying to debug why I was getting a binding excption at the point in the
            // code below where the call to LearningModelBindingPreview.Bind is called
            // with the parameters ("loss", output.loss) where output.loss would be
            // an empty Dictionary<string,float>.
            //
            // The exception would be 
            // "The binding is incomplete or does not match the input/output description. (Exception from HRESULT: 0x88900002)"
            // And I couldn't find symbols for Windows.AI.MachineLearning.Preview to debug it.
            // So...this could be wrong but it works for me and the 3 values here correspond
            // to the 3 classifications that my classifier produces.
            //
            this.loss.Add("daschund", float.NaN);
            this.loss.Add("dog", float.NaN);
            this.loss.Add("pony", float.NaN);
        }
    }

    public sealed class DacshundModel
    {
        private LearningModelPreview learningModel;
        public static async Task<DacshundModel> CreateDaschundModel(StorageFile file)
        {
            LearningModelPreview learningModel = await LearningModelPreview.LoadModelFromStorageFileAsync(file);
            DacshundModel model = new DacshundModel();
            model.learningModel = learningModel;
            return model;
        }
        public async Task<DacshundModelOutput> EvaluateAsync(DacshundModelInput input) {
            DacshundModelOutput output = new DacshundModelOutput();
            LearningModelBindingPreview binding = new LearningModelBindingPreview(learningModel);

            binding.Bind("data", input.data);
            binding.Bind("classLabel", output.classLabel);

            // MIKET: this generated line caused me trouble. See MIKET comment above.
            binding.Bind("loss", output.loss);

            LearningModelEvaluationResultPreview evalResult = await learningModel.EvaluateAsync(binding, string.Empty);
            return output;
        }
    }
}

There’s a big comment Smile in that code where I changed what had been generated for me. In short, I found that my model seems to take an input parameter here of type VideoFrame and it seems to produce output parameters of two ‘shapes’;

  • List<string> called “classLabel”
  • Dictionary<string,float> called “loss”

I spent quite a bit of time debugging an exception that I got by passing an empty Dictionary<string,float> as the variable called “loss” as I would see an exception thrown from the call to LearningModelBindingPreview.Bind() saying that the “binding is incomplete”.

It took a while but I finally figured out that I was supposed to pass a Dictionary<string,float> with some entries already in it and you’ll notice in the code above that I pass in 3 floats which I think are related to the 3 tags that my model can categorise against – namely dachshunds, dogs and pony. I’m not at all sure that this is 100% right but it got me past that exception so I went with it Smile

With that, I had some generated code that I thought I could build into an app.

Making a ‘Hello World’ App

I made a very simple UWP app targeting SDK 17110 and made a UI which had a few TextBlocks and a CaptureElement within it.

<Page
    x:Class="App1.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:App1"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    mc:Ignorable="d">

    <Grid Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">
        <CaptureElement x:Name="captureElement"/>
        <StackPanel HorizontalAlignment="Center" VerticalAlignment="Bottom">
            <StackPanel.Resources>
                <Style TargetType="TextBlock">
                    <Setter Property="Foreground" Value="White"/>
                    <Setter Property="FontSize" Value="18"/>
                    <Setter Property="Margin" Value="5"/>
                </Style>
            </StackPanel.Resources>
            <TextBlock Text="Category " HorizontalTextAlignment="Center"><Run Text="{x:Bind Category,Mode=OneWay}"/></TextBlock>
            <StackPanel Orientation="Horizontal" HorizontalAlignment="Center">
                <TextBlock Text="Dacshund "><Run Text="{x:Bind Dacshund,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Dog "><Run Text="{x:Bind Dog,Mode=OneWay}"/></TextBlock>
                <TextBlock Text="Pony "><Run Text="{x:Bind Pony,Mode=OneWay}"/></TextBlock>
            </StackPanel>
        </StackPanel>
    </Grid>
</Page>

and then I wrote some code which would get hold of a camera on the device (I went for the back panel camera), wire it up to the CaptureElement in the UI and also to make use of a MediaFrameReader to get preview video frames off the camera which I’m hoping to run through the classification model.

That code is here – there’s some discussion to come in a moment about the RESIZE constant;

//#define RESIZE
namespace App1
{
    using daschunds.model;
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.Linq;
    using System.Runtime.InteropServices;
    using System.Threading;
    using System.Threading.Tasks;
    using Windows.Devices.Enumeration;
    using Windows.Graphics.Imaging;
    using Windows.Media.Capture;
    using Windows.Media.Capture.Frames;
    using Windows.Media.Devices;
    using Windows.Storage;
    using Windows.Storage.Streams;
    using Windows.UI.Xaml;
    using Windows.UI.Xaml.Controls;
    using Windows.UI.Xaml.Media.Imaging;
    using Windows.Media;
    using System.ComponentModel;
    using System.Runtime.CompilerServices;
    using Windows.UI.Core;
    using System.Runtime.InteropServices.WindowsRuntime;

    public sealed partial class MainPage : Page, INotifyPropertyChanged
    {
        public event PropertyChangedEventHandler PropertyChanged;

        public MainPage()
        {
            this.InitializeComponent();
            this.inputData = new DacshundModelInput();
            this.Loaded += OnLoaded;
        }
        public string Dog
        {
            get => this.dog;
            set => this.SetProperty(ref this.dog, value);
        }
        string dog;
        public string Pony
        {
            get => this.pony;
            set => this.SetProperty(ref this.pony, value);
        }
        string pony;
        public string Dacshund
        {
            get => this.daschund;
            set => this.SetProperty(ref this.daschund, value);
        }
        string daschund;
        public string Category
        {
            get => this.category;
            set => this.SetProperty(ref this.category, value);
        }
        string category;
        async Task LoadModelAsync()
        {
            var file = await StorageFile.GetFileFromApplicationUriAsync(
                new Uri("ms-appx:///Model/daschunds.onnx"));

            this.learningModel = await DacshundModel.CreateDaschundModel(file);
        }
        async Task<DeviceInformation> GetFirstBackPanelVideoCaptureAsync()
        {
            var devices = await DeviceInformation.FindAllAsync(
                DeviceClass.VideoCapture);

            var device = devices.FirstOrDefault(
                d => d.EnclosureLocation.Panel == Windows.Devices.Enumeration.Panel.Back);

            return (device);
        }
        async void OnLoaded(object sender, RoutedEventArgs e)
        {
            await this.LoadModelAsync();

            var device = await this.GetFirstBackPanelVideoCaptureAsync();

            if (device != null)
            {
                await this.CreateMediaCaptureAsync(device);
                await this.mediaCapture.StartPreviewAsync();

                await this.CreateMediaFrameReaderAsync();
                await this.frameReader.StartAsync();
            }
        }

        async Task CreateMediaFrameReaderAsync()
        {
            var frameSource = this.mediaCapture.FrameSources.Where(
                source => source.Value.Info.SourceKind == MediaFrameSourceKind.Color).First();

            this.frameReader =
                await this.mediaCapture.CreateFrameReaderAsync(frameSource.Value);

            this.frameReader.FrameArrived += OnFrameArrived;
        }

        async Task CreateMediaCaptureAsync(DeviceInformation device)
        {
            this.mediaCapture = new MediaCapture();

            await this.mediaCapture.InitializeAsync(
                new MediaCaptureInitializationSettings()
                {
                    VideoDeviceId = device.Id
                }
            );
            // Try and set auto focus but on the Surface Pro 3 I'm running on, this
            // won't work.
            if (this.mediaCapture.VideoDeviceController.FocusControl.Supported)
            {
                await this.mediaCapture.VideoDeviceController.FocusControl.SetPresetAsync(FocusPreset.AutoNormal);
            }
            else
            {
                // Nor this.
                this.mediaCapture.VideoDeviceController.Focus.TrySetAuto(true);
            }
            this.captureElement.Source = this.mediaCapture;
        }

        async void OnFrameArrived(MediaFrameReader sender, MediaFrameArrivedEventArgs args)
        {
            if (Interlocked.CompareExchange(ref this.processingFlag, 1, 0) == 0)
            {
                try
                {
                    using (var frame = sender.TryAcquireLatestFrame())
                    using (var videoFrame = frame.VideoMediaFrame?.GetVideoFrame())
                    {
                        if (videoFrame != null)
                        {
                            // From the description (both visible in Python and through the
                            // properties of the model that I can interrogate with code at
                            // runtime here) my image seems to to be 227 by 227 which is an 
                            // odd size but I'm assuming that I should resize the frame here to 
                            // suit that. I'm also assuming that what I'm doing here is 
                            // expensive 

#if RESIZE
                            using (var resizedBitmap = await ResizeVideoFrame(videoFrame, IMAGE_SIZE, IMAGE_SIZE))
                            using (var resizedFrame = VideoFrame.CreateWithSoftwareBitmap(resizedBitmap))
                            {
                                this.inputData.data = resizedFrame;
#else       
                                this.inputData.data = videoFrame;
#endif // RESIZE

                                var evalOutput = await this.learningModel.EvaluateAsync(this.inputData);

                                await this.ProcessOutputAsync(evalOutput);

#if RESIZE
                            }
#endif // RESIZE
                        }
                    }
                }
                finally
                {
                    Interlocked.Exchange(ref this.processingFlag, 0);
                }
            }
        }
        string BuildOutputString(DacshundModelOutput evalOutput, string key)
        {
            var result = "no";

            if (evalOutput.loss[key] > 0.25f)
            {
                result = $"{evalOutput.loss[key]:N2}";
            }
            return (result);
        }
        async Task ProcessOutputAsync(DacshundModelOutput evalOutput)
        {
            string category = evalOutput.classLabel.FirstOrDefault() ?? "none";
            string dog = $"{BuildOutputString(evalOutput, "dog")}";
            string pony = $"{BuildOutputString(evalOutput, "pony")}";
            string dacshund = $"{BuildOutputString(evalOutput, "daschund")}";

            await this.Dispatcher.RunAsync(CoreDispatcherPriority.Normal,
                () =>
                {
                    this.Dog = dog;
                    this.Pony = pony;
                    this.Dacshund = dacshund;
                    this.Category = category;
                }
            );
        }

        /// <summary>
        /// This is horrible - I am trying to resize a VideoFrame and I haven't yet
        /// found a good way to do it so this function goes through a tonne of
        /// stuff to try and resize it but it's not pleasant at all.
        /// </summary>
        /// <param name="frame"></param>
        /// <param name="width"></param>
        /// <param name="height"></param>
        /// <returns></returns>
        async static Task<SoftwareBitmap> ResizeVideoFrame(VideoFrame frame, int width, int height)
        {
            SoftwareBitmap bitmapFromFrame = null;
            bool ownsFrame = false;

            if (frame.Direct3DSurface != null)
            {
                bitmapFromFrame = await SoftwareBitmap.CreateCopyFromSurfaceAsync(
                    frame.Direct3DSurface,
                    BitmapAlphaMode.Ignore);

                ownsFrame = true;
            }
            else if (frame.SoftwareBitmap != null)
            {
                bitmapFromFrame = frame.SoftwareBitmap;
            }

            // We now need it in a pixel format that an encoder is happy with
            var encoderBitmap = SoftwareBitmap.Convert(
                bitmapFromFrame, BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            if (ownsFrame)
            {
                bitmapFromFrame.Dispose();
            }

            // We now need an encoder, should we keep creating it?
            var memoryStream = new MemoryStream();

            var encoder = await BitmapEncoder.CreateAsync(
                BitmapEncoder.JpegEncoderId, memoryStream.AsRandomAccessStream());

            encoder.SetSoftwareBitmap(encoderBitmap);
            encoder.BitmapTransform.ScaledWidth = (uint)width;
            encoder.BitmapTransform.ScaledHeight = (uint)height;

            await encoder.FlushAsync();

            var decoder = await BitmapDecoder.CreateAsync(memoryStream.AsRandomAccessStream());

            var resizedBitmap = await decoder.GetSoftwareBitmapAsync(
                BitmapPixelFormat.Bgra8, BitmapAlphaMode.Premultiplied);

            memoryStream.Dispose();

            encoderBitmap.Dispose();

            return (resizedBitmap);
        }
        void SetProperty<T>(ref T storage, T value, [CallerMemberName] string propertyName = null)
        {
            storage = value;
            this.PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
        }
        DacshundModelInput inputData;
        int processingFlag;
        MediaFrameReader frameReader;
        MediaCapture mediaCapture;
        DacshundModel learningModel;

        static readonly int IMAGE_SIZE = 227;
    }
}

In doing that, the main thing that I was unclear about was whether I need to resize the VideoFrames to fit with my model or whether I could leave them alone and have the code in between me and the model “do the right thing” with the VideoFrame?

Partly, that confusion comes from my model’s description seeming to be say that it was expecting frames at a resolution of 227 x 227 in BGR format and that feels like a very odd resolution to me.

Additionally, I found that trying to resize a VideoFrame seemed to be a bit of a painful task and I didn’t find a better way than going through a SoftwareBitmap with a BitmapEncoder, BitmapDecoder and a BitmapTransform.

The code that I ended up with got fairly ugly and I was never quite sure whether I needed it or not and so, for the moment, I conditionally compiled that code into my little test app so that I can switch between two modes of;

  • Pass the VideoFrame untouched to the underlying evaluation layer
  • Attempt to resize the VideoFrame to 227 x 227 before passing it to the underlying evaluation layer.

I’ve a feeling that it’s ok to leave the VideoFrame untouched but I’m about 20% sure on that at the time of writing and the follow on piece here assumes that I’m running with that version of the code.

Does It Work?

How does the app work out? I’m not yet sure Smile and there’s a couple of things where I’m not certain.

  • I’m running on a Surface Pro 3 where the camera has a fixed focus and it doesn’t do a great job of focusing on my images (given that I’ve no UI to control the focus) and so it’s hard to tell at times how good an image the camera is getting. I’ve tried it with both the front and back cameras on that device but I don’t see too much difference.
  • I’m unsure of whether the way in which I’m passing the VideoFrame to the model is right or not.

But I did run the app and presented it with 3 pictures – one of a dachshund, one of an alsatian (which it should understand is a dog but not a dachshund) and one of a pony.

Here’s some examples showing the sort of output that the app displayed;

dacs

I’m not sure about the category of ‘dog’ here but the app seems fairly confident that this is both a dog and a dachshund so that seems good to me.

Here’s another (the model has been trained on alsatian images to some extent);

als

and so that seems like a good result and then I held up my phone to the video stream displaying an image of a pony;

pony

and so that seems to work reasonably well and that’s the code which does not resize the image down to 227×227 and I found that the code that did resize didn’t seem to work the same way so maybe my notion of resizing (or the actual code which does the resizing) isn’t right.

Wrapping Up

First impressions here are very good Smile in that I managed to get something working in very short time.

Naturally, it’d be interesting to try and build a better understanding around the binding of parameters and I’d also be interested to try this out with a camera that was doing a better job of focusing.

It’d also be interesting to point the camera at real world objects rather than 2D pictures of those objects and so perhaps I need to build a model that classifies something a little more ‘household’ than dogs and ponies to make it easier to test without going out into a field Smile

I’d also like to try some of this out on other types of devices including HoloLens as/when that becomes possible.

Code

If you want the code that I put together here then it’s in this github repo;

https://github.com/mtaulty/WindowsMLExperiment

Keep in mind that this is just my first experiment, I’m muddling my way through and it looks like the code conditionally compiled out with the RESIZE constant can be ignored unless I hear otherwise and I’ll update the post if I do.

Lastly, you’ve probably noticed many different spellings of the word dachshund in the code and in the blog post – I should have stuck with poodles Smile