A Simple glTF Viewer for HoloLens

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

A quick note – I’m still not sure whether I should bring this blog back to life having paused it but I had a number of things running around in my head that are easier if written down and so that’s what I’ve done Smile

Viewing 3D Files on a HoloLens

A few weeks ago, a colleague came to me with 2 3D models packaged in files and said “I just want to show these 2 models to a customer on a HoloLens”.

I said to him;

“No problem, open up the files in 3D Viewer on the PC, have a look at them and then transfer them over to HoloLens and view them in 3D Viewer there”

Having passed on this great advice, I thought I’d better try it out myself and, like much of my best advice, it didn’t actually work Winking smile

Here’s why it doesn’t work. I won’t use the actual models in this blog post so let’s assume that it was this model from Remix3D;

image

Now, I can open that model in Paint3D or 3D Viewer, both of which are free and built-in on Windows 10 and I can get a view something like this one;

image

which tells me that this model is 68,000 polygons so it’s not a tiny model but it’s not a particularly big one either and I’d expect that it would display fine on a mobile device which might not be the case if it was 10x or 100x times as big.

Now, knowing that there’s an application on my PC called “3D Viewer” and knowing that there’s one on my HoloLens called “3D Viewer” might lead me to believe that they are the same application with same set of capabilities and so I might just expect to be able to move to the HoloLens, run the Mixed Reality Viewer application and open the same model there.

But I can’t.

3D Viewer on PC

If you run up the 3D Viewer on a PC then you get an app which runs in a Window and which displays a 3D model with a whole range of options including being able to control how the model is rendered, interacting with animations, changing the lighting and so on;

image

The application lets you easily load up model files from the file system or from the Remix3D site;

image

You can also use this application to “insert” the model into the real-world via a “Mixed Reality” mode as below;

image

I’d say that (for me) this is very much at the “Augmented Reality” end of the spectrum in that while the model here might look like it’s sitting on my monitor, I can actually place it in mid-air so I’m not sure that it’s really identifying planes for the model to sit on. I can pick up my laptop and wander around the model and that works to some extent although I find it fairly easy to confuse the app.

One other thing that I’d say in passing is that I have no knowledge around how this application offers this experience or how a developer would build a similar experience – I’m unaware of any platform APIs that help you build this type of thing for a PC using a regular webcam in this way.

3D Viewer on HoloLens

3D Viewer on HoloLens also runs in a window as you can see here;

image

and you can also open up files from the file system or from the Remix3D site or from a pre-selected list of “Holograms” which line up with the content that used to be available in the original “Holograms” app going all the way back to when the device was first made available.

The (understandable) difference here is that when you open a model, it is not displayed as a 3D object inside of the application’s Window as that would be a bit lame on a HoloLens device.

Instead, the model is added to the HoloLens shell as shown below;

image

This makes sense and it’s very cool but on the one hand it’s not really an immersive viewing application – it’s a 2D application which is invoking the shell to display a 3D object.

As an aside, it’s easy to ask the Shell to display a 3D object using a URI scheme and I wrote about that here a while ago and I suspect (i.e. I don’t know) that this is what the 3D Viewer application is doing here;

Placing 3D Models in the Mixed Reality Home

The other aspect of this is that 3D models displayed by the Shell have requirements;

Create 3D models for use in the home

and so you can’t just display an arbitrary model here and I tend to find that most models that I try and use in this way don’t work.

For example, if we go back to the model of a Surface Book 2 that I displayed in 3D Viewer on my PC then I can easily copy that model across to my HoloLens using the built-in “Media Transfer Protocol” support which lets me just see the device’s storage in Explorer once I’ve connected it via USB and then open it up in 3D Viewer where I see;

image

and so I find that regardless of their polygon count most, general models, don’t open within the 3D Viewer on HoloLens – they tend to display this message instead and that’s understandable given that the application is trying to;

  • do the right thing by not having the user open up huge models that won’t then render well
  • integrate the models into the Shell experience which has requirements that presumably can’t just be ignored.

So, if you want a simple viewer which just displays an arbitrary model in an immersive setting then 3D Viewer isn’t quite so general purpose.

This left me stuck with my colleague who wanted something simple to display his models and so I made the classic mistake.

I said “I’ll write one for you” Winking smile

This Does Not Solve the Large/Complex Models Problem

I should stress that me setting off to write a simple, custom viewer is never going to solve the problem of displaying large, complex 3D models on a mobile device like a HoloLens and, typically, you need to think about some strategy for dealing with that type of complexity on a mobile device. There are guidelines around this type of thing here;

Performance recommendations for HoloLens apps

and there are tools/services out there to help with this type of thing including tools like;

My colleague originally provided me with a 40K polygon model and a 500K polygon model.

I left the 40K model alone and used 3DS Max to optimise the 500K poly model down to around 100K which rendered fine for me on HoloLens through the application that I ended up building.

It took a bit of experimentation in the different tools to find the right way to go about it as some tools failed to load the models, others produced results that didn’t look great, etc. but it didn’t take too long to decimate the larger one.

Building glTF Viewer Version 1.0

So, to help out with the promise I’d made to my colleague, I built a simple app. It’s in the Windows Store over here and the source for it is on Github over here.

It’s currently heading towards Version 2.0 when I merge the branches back together and get the Store submission done.

For version 1.0, what I wanted was something that would allow a user to;

  • open a 3D file in .GLB/.GLTF format from their HoloLens storage.
  • display the 3D model from it.
  • manipulate it by scaling, rotating and translating.
  • have as little UI as possible and drive any needed interactions through speech.

and that was pretty much all that I wanted – I wanted to keep it very simple and as part of that I decided to deliberately avoid;

  • anything to do with other 3D model file formats but was, instead, quite happy to assume that people would find conversion tools (e.g. Paint3D, 3D Builder, etc) that could generate single file (.GLB) or multi-file (.GLTF) model files for them to import.
  • any attempt to open up files from cloud locations via OneDrive etc.

With that in mind, I set about trying to build out a new Unity-based application and I made a couple of quick choices;

  • that I would use the Mixed Reality Toolkit for Unity and I chose to use the current version of the Toolkit rather than the vNext toolkit as that’s still “work in progress” although I plan to port at a later point.
    • this meant that I could follow guidance and use the LTS release of Unity – i.e. a 2017.4.* version which is meant to work nicely with the toolkit.
  • that I would use UnityGLTF as a way of reading GLTF files inside of Unity.
  • that I would use sub-modules in git as a way of bringing those two repos into my project as described by my friend Martin over here.

I also made a choice that I would use standard file dialogs for opening up files within my application. This might seem like an obvious choice but those dialogs only really work nicely once your HoloLens is running on the “Redstone 5” version of Windows 10 as documented here;

Mixed Reality Release Notes – Current Release Notes

and so I was limiting myself to only running on devices that are up-to-date but I don’t think that’s a big deal for HoloLens users.

In terms of how the application is put together, it’s a fairly simple Unity application using only a couple of features from the Mixed Reality Toolkit beyond the base support for cameras, input etc.

Generally, beyond a few small snags with Unity when it came to generating the right set of assets for the Windows Store I got that application built pretty quickly & submitted it to the Store.

However, I did hit a few small challenges…

A Small Challenge with UnityGLTF

I did hit a big of a snag because the Mixed Reality Toolkit makes some use of pieces from a specific version of UnityGLTF to provide functionality which loads the Windows Mixed Reality controller models when running on an immersive headset.

UnityGLTF (scripts and binaries) in the Mixed Reality Toolkit

I wanted to be able to bring all of UnityGLTF (a later version) into my project alongside the Mixed Reality Toolkit and so that caused problems because both scripts & binaries would be duplicated and Unity wasn’t very happy about that Smile

I wrote a little ‘setup’ script to remove the GLTF folder from the Mixed Reality Toolkit which was ok except it left me with a single script named MotionControllerVisualizer.cs which wouldn’t build because it had a dependency on UnityGLTF methods that were no longer part of the Unity GLTF code-base (i.e. I happened to have the piece of code which seemed to have an out-of-date dependency).

That was a little tricky for me to fix so I got rid of that script too and fixed up the scripts that took a dependency on it by adding my own, mock implementation of that class into my project knowing that nothing in my project was ever going to display a motion controller anyway.

It’s all a bit “hacky” but it got me to the point where I could combine the MRTK and UnityGLTF in one place and build out what I wanted.

A Small Challenge with Async/Await and CoRoutines

One other small challenge that I hit while putting together my version 1.0 application is the mixing of the C# async/await model with Unity’s CoRoutines.

I’ve hit this before and I fully understand where Unity has come from in terms of using CoRoutines but it still bites me in places and, specifically, it bit me a little here in that I had code which was using routines within the UnityGLTF which are CoRoutine based and I needed to get more information around;

  • when that code completed
  • what exceptions (if any) got thrown by that code

There’s a lot of posts out there on the web around this area including these examples;

and in my specific case I had to write some extra code to try and glue together running a CoRoutine, catching exceptions from it and tying it into async/await but it wasn’t too challenging, it just felt like “extra work” that I’m sure in later years won’t have to be done as these two models get better aligned. Ironically, this situation was possibly more clear-cut when async/await weren’t really available to use inside of Unity’s scripts.

Another Small Challenge with CoRoutines & Unity’s Threading Model

Another small challenge here is that the UnityGLTF code which loads a model needs to, naturally, create GameObjects and other UI constructs inside of Unity which aren’t aren’t thread-safe and have affinity to the UI thread. So, there’s no real opportunity to run this potentially expensive CoRoutine on some background thread but, rather, it hogs the UI thread a bit while it’s loading and creating GameObjects.

I don’t think that’s radically different from other UI frameworks but I did contemplate trying to abstract out the creation of the UI objects so as to defer it until same later point when it could all be done in one go but I haven’t attempted to do that and so, currently, while the GLTF loading is happening my UI is displaying a progress wheel which can miss a few updates Sad smile

Building glTF Viewer Version 2.0

Having produced my little Version 1.0 app and submitted it to the Store, the one thing that I really wanted to add was the support for a “shared holographic experience” such that multiple users could see the same model in the same physical place. It’s a common thing to want to do with HoloLens and it seems to be found more in large, complex, enterprise apps than in just simple, free tools from the Store and so I thought I would try and rectify that a little.

In doing so, I wanted to try and keep any network “infrastructure” as minimal as possible and so I went with the following assumptions.

  • that the devices that wanted to share a hologram were in the same space on the same network and that network would allow multicast packets.
  • sharing is assumed in the sense that the experience would automatically share holograms rather than the user having to take some extra steps.
  • that not all the devices would necessarily have the files for the models that are loaded on the other devices.
  • that there would be no server or cloud connectivity required.

The way in which I implemented this centres around a HoloLens running the glTF Viewer app acting as a basic web server which serves content out of its 3D Objects folder such that other devices can request that content and copy it into their own 3D Objects folder.

The app then operates as below to enable sharing;

  • When a model is opened on a device
    • The model is given a unique ID.
    • A list of all the files involved in the model is collected (as GLTF models can be packaged as many files) as the model is opened.
    • A file is written to the 3D Objects folder storing a relative URI for each of these files to be obtained remotely by another device.
    • A spatial anchor for the model is exported into another file stored in the 3D Objects folder.
    • A UDP message is multi-casted to announce that a new model (with an ID) is now available from a device (with an IP address).
    • The model is made so that it can be manipulated (scale, rotate, translate) and those manipulations (relative to the parent) are multi-cast over the network with the model identifier attached to them.
  • When a UDP message announcing a new model is received on a device
    • The device asks the user whether they want to access that model.
    • The device does web requests to the originating device asking for the URIs for all the files involved in that model.
    • The device downloads (if necessary) each model file to the same location in its 3D Objects folder.
    • The device downloads the spatial anchor file.
    • The device displays the model from its own local storage & attaches the spatial anchor to place it in the same position in the real world.
    • The model is made so that it cannot be manipulated but, instead, picks up any UDP multicasts with update transformations and applies them to the model (relative to its parent which is anchored).

and that’s pretty much it.

This is all predicated on the idea that I can have a HoloLens application which is acting as a web server and I had in mind that this should be fairly easy because UWP applications (from 16299+) now support .NET Standard 2.0 and HttpListener is part of .NET Standard 2.0 and so I could see no real challenge with using that type inside of my application as I’d written about here;

UWP and .NET Standard 2.0–Remembering the ‘Forgotten’ APIs 🙂

but there were a few challenges that I met with along the way.

Challenge Number 1 – Picking up .NET Standard 2.0

I should say that I’m long past the point of being worried about being seen to not understand something and am more at the point of realising that I don’t really understand anything  Smile

I absolutely did not understand the ramifications of wanting to modify my existing Unity project to start making use of HttpListener Smile

Fairly early on, I came to a conclusion that I wasn’t going to be able to use HttpListener inside of a Unity 2017.4.* project.

Generally, the way in which I’ve been developing in Unity for HoloLens runs something like this;

  • I am building for the UWP so that’s my platform.
  • I use the .NET scripting backend.
  • I write code in the editor and I hide quite a lot of code from the editor behind ENABLE_WINMD_SUPPORT conditional compilation because the editor runs on Mono and it doesn’t understand the UWP API surface.
  • I press the build button in Unity to generate a C#/.NET project in Visual Studio.
  • I build that project and can then use it to deploy, debug my C#/UWP application and generate store packages and so on.

It’s fairly simple and, while it takes longer than just working in Visual Studio, you get used to it over time.

One thing that I haven’t really paid attention to as part of that process is that even if I select the very latest Windows SDK in Unity as below;

image

then the Visual Studio project that Unity generates doesn’t pick up the latest .NET packages but, instead, seems to downgrade my .NET version as below;

image

I’d struggled with this before (in this post under the “Package Downgrade Issue”) without really understanding it but I think I came to a better understanding of this as part of trying to get HttpListener into my project here.

In bringing in HttpListener, I hit build problems and I instantly assumed that I needed to upgrade Unity because Unity 2017.* does not offer .NET Standard 2.0 as an API Compatibility Level as below;

image

and I’d assumed that I’d need to move to a Unity 2018.* version in order to pick up .NET Standard 2.0 as I’d seen that Unity 2018.* had started to support .NET Standard 2.0.

Updated scripting runtime in Unity 2018.1: What does the future hold?

and so needing to pick up a Unity 2018.* version and switch in there to use .NET Standard 2.0 didn’t surprise me and so I got version 2018.2.16f1 and I opened up my project in there and switched to .NET Standard 2.0 and that seemed like a fine thing to do;

image

but it left me with esoteric build failures as I hadn’t realised that Unity’s deprecation of the .NET Scripting Backend as per this post;

Deprecation of support for the .Net Scripting backend used by the Universal Windows Platform

had a specific impact in that it meant that new things which came along like SDK 16299 with its support for .NET Standard 2.0 didn’t get implemented in the .NET Scripting Backend for Unity.

They are only present in the IL2CPP backend and I presume that’s why my generated .NET projects have been downgrading the .NET package used.

So, if you want .NET Standard 2.0 then you need SDK 16299+ and that dictates Unity 2018.+ and that dictates moving to the IL2CPP backend rather than the .NET backend.

I verified this over here by asking Unity about it;

2018.2.16f1, UWP, .NET Scripting Backend, .NET Standard 2.0 Build Errors

and that confirms that the .NET Standard 2.0 APIs are usable from the editor and from the IL2CPP back-end but they aren’t going to work if you’re using .NET Scripting Backend.

I did try. I hid my .NET code in libraries and referenced them but, much like the helpful person told me on the forum – “that didn’t work”.

Challenge Number 2 – Building and Debugging with IL2CPP on UWP/HoloLens

Switching to the IL2CPP back-end really changed my workflow around Unity. Specifically, it emphasised that I need to spend as much time in the editor because I find that the two phases of;

  • building inside of the Unity editor
  • building the C++ project generated by the Unity editor

is a much lengthier process than doing the same thing on the .NET backend and Unity has an article about trying to improve this;

Optimizing IL2CPP build times

but I didn’t really find that I could get my build times to come down much and I’d find that maybe a one-line change could take me into a 20m+ build cycle.

The other switch in my workflow was around debugging. There are a couple of options here. It’s possible to debug the generated C++ code and Unity has an article on it here;

Universal Windows Platform: Debugging on IL2CPP Scripting Backend

but I’d have to say that it’s pretty unproductive trying to find the right piece of code and then step your way through generated C++ which looks like;

but you can do it and I’ve had some success with it and one aspect of it is “easy” in that you just open the project, point it at a HoloLens/emulator for deployment & then press F5 and it works.

The other approach is to debug the .NET code because Unity does have support for this as per this thread;

About IL2CPP Managed Debugger

and the details are given again in this article;

Universal Windows Platform: Debugging on IL2CPP Scripting Backend

although I would pay very close attention to the settings that control this as below;

image

and I’d also pay very close attention to the capabilities that your application must have in order to operate as a debuggee. I had to question how to get this working on the Unity Forums;

Unity 2018.2.16f1, UWP, IL2CPP, HoloLens RS5 and Managed Debugging Problems

but I did get it to work pretty reliably on HoloLens in the end but I’d flag a few things that I found;

  • sometimes the debugger wouldn’t attach to my app & I’d have to restart the app. It would be listed as a target in Unity’s “Attach To” dialog in Visual Studio but attaching just did nothing.
  • that the debugger can be very slow – sometimes I’d wait a long time for breakpoints to become active.
  • that the debugger quite often seems to step into places where it can’t figure out the stack frame. Pressing F10 seemed to fix that.
  • that the debugger’s step-over/step-into sometimes didn’t seem to work.
  • that the debugger’s handling of async/await code could be a bit odd – the instruction pointer would jump around in Visual Studio as though it had got lost but the code seemed to be working.
  • that hovering over variables and putting them into the watch windows was quite hit-and-miss.
  • that evaluating arbitrary .NET code in the debugger doesn’t seem to work (I’m not really surprised).
  • breaking on exceptions isn’t a feature as far as I can tell – I think the debugger tells you so as you attach but I’m quite a fan on stopping on first-chance exceptions as a way of seeing what code is doing.

I think that Unity is working on all of this and I’ve found them to be great in responding on their forums and on Twitter, it’s very impressive.

In my workflow, I tended to use both the native debugger & the managed debugger to try and diagnose problems.

One other thing that I did find – I had some differences in behaviour between my app when I built it with “script debugging” and when I didn’t. It didn’t affect me too much but it did lower my overall level of confidence in the process.

Putting that to one side, I’d found that I could move my existing V1.0 project into Unity 2018.* and change the backend from .NET to IL2CPP and I could then make use of types like HttpListener and build and debug.

However, I found that the code stopped working Smile

Challenge 3 – File APIs Change with .NET Standard 2.0 on UWP

I hadn’t quite seen this one coming. There’s a piece of code within UnityGLTF which loads files;

FileLoader.cs

In my app, I open a file dialog, have the user select a file (which might result in loading 1 or many files depending on whether this is a single-file or multi-file model) and it runs through a variant of this FileLoader code.

That code uses File.Exists() and File.OpenRead() and, suddenly, I found that the code was no longer working for files which did exist and which my UWP app did have access to.

It’s important to note that the file in question would be a brokered file for the UWP app (i.e. one which it accesses via a broker to ensure it has the right permissions) rather than just say a file within the app’s own package or it’s own dedicated storage. In particular, my file would reside within the 3D Objects folder.

How could that break? It comes back to .NET Standard 2.0 because these types of File.* functions work differently for UWP brokered files depending on whether you are on SDK 16299+ with .NET Standard 2.0 or on an earlier SDK before .NET Standard 2.0 came along.

The thorny details of that are covered in this forum post;

File IO operations not working when broadFileSystemAccess capability declared

which gives some of the detail but, essentially, for my use case File.Exists and File.OpenRead were now causing me problems and so I had to replace some of that code which brings me back to…

Challenge 4 – Back to CoRoutines, Enumerators and Async

As I flagged earlier, mixing and matching an async model based around CoRoutines in Unity (which is mostly AFAIK about asynchronous rather than concurrent code) with one based around Tasks can be a bit of a challenge.

With the breaking change to File.OpenRead(), I had to revisit the FileLoader code and modify it such that it still presented an IEnumerator-based pattern to the rest of the UnityGLTF code while, internally, it needed to move from using the synchronous File.OpenRead() to the asynchronous StorageFile.OpenReadAsync().

It’s not code that I’m particularly proud of and wouldn’t like to highlight it here but it felt like one of those situations where I got boxed into a corner and had to make the best of what I had to work with Smile

Challenge 5 – ProgressRings in the Mixed Reality Toolkit

I’m embarrassed to admit that I spent a lot longer trying to get a ProgressRing from the Mixed Reality Toolkit to work than I should have.

I’ve used it before, there’s an example over here;

Progress Example

but could I get it to show up? No.

In the end, I decided that there was something broken in the prefab that makes up the progress ring and I switched from using the Solver Radial View to using the Solver Orbital script to manage how the progress ring moves around in front of the user & that seemed to largely get rid of my problems.

Partially, this was a challenge because I hit it at the time when I was struggling to get used to my new mode of debugging and I just couldn’t get this ring to show up.

In the end, I solved it by just making a test scene and watching how that behaved in the editor at runtime before applying that back to my real scene which is quite often how I seem to solve these types of problems in Unity.

Challenge 6 – UDP Multicasting on Emulators and HoloLens

I chose to use UDP multicasting as a way for one device to notify others on the same network that it had a new model for them to potentially share.

This seemed like a reasonable choice but it can make it challenging to debug as I have a single HoloLens and have never been sure whether a HoloLens emulator can/can’t participate in UDP multicasting or whether there’s any settings that can be applied to the virtual machine to make that work.

I know when I wrote this post that I’d failed to get multicasting working on the emulator and this time around I tried a few combinations before giving up and writing a test-harness for my PC to act as a ‘mock’ HoloLens from the point of view of being able to generate/record/playback messages it received from the real HoloLens.

I’ve noticed over time a number of forum posts asking whether a HoloLens can receive UDP traffic at all such as;

and there are more.

I can certainly verify that a UWP app on HoloLens can send/receive UDP multicast traffic but I’d flag that I have seen situations where my current device (running RS5) has got into a situation where UDP traffic seems to fail to be delivered into my application until I reboot the device. I’ve seen it very occasionally but more than once so I’d flag that this can happen on the current bits & might be worth bearing in mind for anyone trying to debug similar code on similar versions.

Closing Off

I learned quite a lot in putting this little test application together – enough to think it was worth opening up my blog and writing down some of the links so that I (or others) can find them again in the future.

If you’ve landed here via search or have read the whole thing ( ! ) then I hope you found something useful.

I’m not sure yet whether this one-off-post is the start of me increasing the frequency of posting here so don’t be too surprised if this blog goes quiet again for a while but do feel very free to reach out if I can help around these types of topics and, of course, feel equally free to point out where I’ve made mistakes & I’ll attempt to fix them Smile

Update – One Last Thing (Challenge 7), FileOpenPicker, Suspend/Resume and SpeechRecognizer

Finding a Suspend/Resume Problem with Speech

I’d closed off this blog post and published it to my blog and I’d shipped version 2.0 of my app to the Store when I came across an extra “challenge” in that I noticed that my voice commands seemed to be working only part of the time and, given that the app is driven by voice commands, that seemed like a bit of a problem.

It took me a little while to figure out what was going on because I took the app from the Store and installed it and opened up a model using the “open” command and all was fine but then I noticed that I couldn’t then use the “open” command for a second time or the “reset” command for a first time.

Naturally, I dusted the code back off and rebuilt it in debug mode and tried it out and it worked fine.

So, I rebuilt in release mode and I got mixed results in finding that sometimes things worked and other times they didn’t and it took me a while to realise that it was the debugger which was making the difference. With the debugger attached, everything worked as I expected but when running outside of the debugger, I would find that the voice commands would only work until the FileOpenPicker had been on the screen for the first time. Once that dialog had been on the screen the voice commands no longer worked and that was true whether a file had been selected or whether the dialog had simply been cancelled.

So, what’s going on? Why would putting a file dialog onto the screen cause the application’s voice commands to break and only when the application was not running under a debugger?

The assumption that I made was that the application was suffering from a suspend/resume problem and that the opening of the file dialog was causing my application to suspend (and somehow break its voice commands) before choosing a file such that when my application resumed the voice commands were broken.

Why would my app suspend/resume just to display a file picker? I’d noticed previously that there is a file dialog process running on HoloLens so perhaps it’s fair to assume/guess that opening a file involves switching to another app altogether and, naturally, that might mean that my application suspends during that process.

I remember that this was also possible under phone implementations and (if I remember correctly) the separate-process model on phones was the reason why the UWP ended up with AndContinue() style APIs in the early days when the phone and PC platforms were being unified.

Taking that assumption further – it’s well known that when you are debugging a UWP app in Visual Studio the “Process Lifecycle Management” (PLM) events are disabled by the debugger. That’s covered in the docs here and so I could understand why my app might be working in the debugger and not working outside of the debugger.

That said, I did find that my app still worked when I manually used the debugger’s capability to suspend/resume (via the toolbar) which was a bit of a surprise as I expected it to break but I was fairly convinced by now that that my problem was due to suspend/resume.

So, it seems like I have a suspend/resume problem. What to do about it?

Resolving the Suspend/Resume Problem with Speech

My original code was using speech services provided by the Mixed Reality Toolkit’s SpeechInputSource.cs and SpeechInputHandler.cs utilities and I tried quite a few experiments around enabling/disabling these around suspend/resume events from the system but I didn’t find a recipe that made them work.

I took away my use of that part of the MRTK and started directly using SpeechRecognizer myself so that I had more control of the code & I kept that code as minimal as possible.

I still hit problems. My code was organised around spinning up a single SpeechRecognizer instance, keeping hold of it and repeatedly asking it via the RecognizeAsync() method to recognise voice commands.

I would find that this code would work fine until the process had suspended/resumed and then it would break. Specifically, the RecognizeAsync() code would return Status values of Success and Confidence values of Rejected.

So, it seemed that having a SpeechRecognizer kicking around across suspend/resume cycles wasn’t the best strategy and I moved to an implementation which takes the following approach;

  • instantiate SpeechRecognizer
  • add to its Constraints collection an instance of SpeechRecognitionListConstraint
  • compile the constraints via CompileConstraintsAsync
  • call RecognizeAsync making a note of the Text result if the API returns Success and confidence is Medium/High
  • Dispose of the SpeechRecognizer and repeat regardless of whether RecognizeAsync returns a relevant value or not

and the key point seemed to be to avoid keeping a SpeechRecognizer instance around in memory and repeatedly calling RecognizeAsync on it expecting that it would continue to work across suspend/resume cycles.

I tried that out, it seems to work & I shipped it off into Store as a V3.0.

I have to admit that it didn’t feel like a very scientific approach to getting something to work – it was more trial and error so if someone has more detail here I’d welcome it but, for the moment, it’s what I settled on.

One last point…

Debugging this Scenario

One interesting part of trying to diagnose this problem was that I found the Unity debugger to be quite helpful.

I found that I could do a “script debugging” build from within Unity and then run that up on my device. I could then use my first speech command to open/cancel the file picker dialog before attaching Unity’s script debugger to that running instance in order to take a look around the C# code and see how opening/cancelling the file dialog had impacted my code that was trying to handle speech.

In some fashion, I felt like I was then debugging the app (via Unity) without really debugging the app (via Visual Studio). It could be a false impression but, ultimately, I think I got it working via this route Smile

Sketchy Experiments with HoloLens, Facial Tracking, Research Mode, RGB Streams, Depth Streams.

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

For quite a while, I have wanted to put together a small sample that ran on HoloLens which did facial tracking but which also identified in X,Y,Z space where the tracked faces were located.

I should start by saying that there is an official sample which does exactly this and it’s over here on github;

Holographic face tracking sample

and so if you’re looking for a sample to learn from which actually works Winking smile then feel free to stop reading at this point.

I spent a little time looking at this sample and when I dug into it, I noticed that it takes a (reasonable) approach to estimating the Z co-ordinate of the face that has been detected in an RGB frame by assuming that a face at a certain distance (1m) will have a certain width and using that as a basis for working out how far away the face is.

That seems to work pretty well but I’d long since been drawn by the idea that the HoloLens could give me a more accurate measurement of how far away a face was and I wanted to experiment with that.

The rest of this post is my rough notes around some experiments that don’t quite get me to the point of making that work but I wrote things down as I went along thinking that the notes might be useful to me in the future and (maybe) to someone else.

I’d urge you to apply a large pinch of salt to what you see here because I’m very much experimenting and definitely don’t have things figured out fully.

That sample is great though because it brings together a number of the pieces that I’d expect to use in my own experiments such as;

  • It uses the media capture APIs to get access to a web camera and read frames from it – classes like MediaFrameSource and MediaFrameReader
  • It uses the CameraIntrinsics class in order to ‘unproject’ from the pixel X,Y co-ordinates of a web-camera-captured-image back into a co-ordinate system of the web camera itself
  • It uses the SpatialCoordinateSystem class to transform from a co-ordinate system of a web-camera back into the world co-ordinate system of the application itself
  • It uses the FaceTracker class in order to identify faces in images taken from the web-camera (there’s no need to call out to the cloud just to detect the bounding rectangles of a face)

The sample also takes an approach that is similar to the one that I wanted to follow in that it tries to do ‘real time’ processing on video frames taken from a stream coming off the camera rather than e.g. taking a single photo image or recording frames into a file or something in order to do some type of ‘one-shot’ or ‘offline’ processing.

In many ways, then, this sample is what I want but I have a few differences in what I want to do;

  • This sample estimates the Z coordinate of the face whereas I’d like to ask the device to give it to me as accurately as I can get.
  • This sample is written in C++ using DirectX whereas I wanted to do something in C# and Unity as C++ is becoming for me a language that I can better read than write as I do so little of it these days.

Getting to the Z-coordinate feels like it’s the main challenge here and I can think of at least a couple of ways of doing it. One might be to make use of the spatial mesh that the HoloLens makes available and use some kind of ‘ray cast’ from the camera along the user’s gaze vector to calculate the length of the ray that hits the mesh and use that as a depth value. For a long time, this sort of idea has been discussed in the HoloLens developer forums;

Access to raw depth data stream

but it initially felt to me like it might not be quite right for tracking people as they moved around in a room and so rather than going in that direction, I wanted to experiment with the new ‘research mode’ which came to HoloLens in the Redstone 4 release and which gives me direct access to a stream of depth images. It seemed like it might be ‘interesting’ to see what it’s like to directly use the depth frames from the camera in order to calculate a Z-coordinate for a pixel in an image taken from the web (RGB) camera.

I first encountered ‘Research Mode’ in this blog post;

Experimenting with Research Mode and Sensor Streams on HoloLens Redstone 4 Preview

and there are a tonne of caveats around using ‘Research Mode’ and so please read that post and the official docs and make sure that you understand the caveats before you do anything with it on your own devices.

Note also that if you want something more definitive (and correct!) around research mode then some new samples were published while I was experimenting around for this post and so make sure that you check those out here;

https://github.com/Microsoft/HoloLensForCV

With that said, what is it that I wanted to achieve in my sample?

  • Read video frames from the RGB camera at some frequency – this is achievable using the UWP media capture APIs.
  • Identify faces within those video frames – achievable using the UWP face detection APIs which will give me bounding boxes (X,Y, width, height) in pixel co-ordinates within the captured image. I could perhaps simplify this into a single X,Y point per face by taking the centre of the bounding box.
  • Read depth frames from the depth camera at some frequency – this is achievable using the UWP media capture APIs on a device running in ‘research mode’.
  • Somehow correlate a depth frame with a video frame.
  • Use the X,Y co-ordinate of the detected face( s ) in a video frame to index into a frame from the depth camera to determine the depth location at that point.
  • Transform the X,Y co-ordinate from the video frame and the Z co-ordinate from the depth frame back into world space in order to display something (e.g. a box or a 3D face) at the location that the face was detected.

This leaves me with a sort of 50:50 ratio between things that I know can be done and things that I haven’t done before and so I set about a few experiments to test a few things out.

Before doing that though I spent a long time reading a couple of articles in the docs;

and it’s worth saying that the LocatableCamera support in Unity is very close to what I want but, unfortunately, the Video Capture side of those APIs only record video to a file rather than giving me a stream of video frames which I can then process using something like the Face detection APIs and so I don’t think it’s really of use to me.

Other people have noticed this here;

HoloLensCameraStream for Unity

although at the time of writing I haven’t dug too far into that project as I encountered it when I’d nearly completed this post.

With all that said, I switched on ‘research mode’ on my HoloLens and tried a few initial experiments that I tried out to see how things looked…

Experiment 1 – Co-ordinating Frames from Multiple Sources

If I want to read video frames and depth frames at the same time then the UWP has the notion of a MultiSourceMediaFrameReader which can do just that for me firing an event when a frame is available from each of the requested media sources and saving me a bunch of work in trying to piece those frames together myself.

I was curious – could I get a frame from the RGB video camera and a frame from the depth camera at the same time pre-correlated by the system for me?

In looking a little deeper, I’m not sure that I can. As far as I can tell, the only way to get hold of a MultiSourceMediaFrameReader is to call MediaCapture.CreateMultiSourceFrameReaderAsync and that requires a MediaCapture which can be initialised for one and only MediaFrameSourceGroup and as the docs say;

“The MediaFrameSourceGroup object represents a set of media frame sources that can be used simultaneously”

and so if I run this little piece of code on my HoloLens;

var colourGroup = await MediaFrameSourceGroup.FindAllAsync();

            foreach (var group in colourGroup)
            {
                Debug.WriteLine($"Group {group.DisplayName}");
                foreach (var source in group.SourceInfos)
                {
                    Debug.WriteLine($"\tSource {source.MediaStreamType}, {source.SourceKind}, {source.Id}");

                    foreach (var profile in source.VideoProfileMediaDescription)
                    {
                        Debug.WriteLine($"\t\tProfile {profile.Width}x{profile.Height}@{profile.FrameRate:N0}");
                    }
                }
            }

then the output I get is as below;

Group Sensor Streaming
     Source VideoRecord, Depth, Source#0@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@15
     Source VideoRecord, Infrared, Source#1@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@3
     Source VideoRecord, Depth, Source#2@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@15
     Source VideoRecord, Infrared, Source#3@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@3
     Source VideoRecord, Color, Source#4@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#5@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#6@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#7@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
Group MN34150
     Source VideoPreview, Color, Source#0@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 896×504@30
         Profile 1344×756@30
         Profile 1408×792@30
         Profile 1280×720@24
         Profile 896×504@24
         Profile 1344×756@24
         Profile 1408×792@24
         Profile 1280×720@20
         Profile 896×504@20
         Profile 1344×756@20
         Profile 1408×792@20
         Profile 1280×720@15
         Profile 896×504@15
         Profile 1344×756@15
         Profile 1408×792@15
         Profile 1280×720@5
         Profile 896×504@5
         Profile 1344×756@5
         Profile 1408×792@5
     Source VideoRecord, Color, Source#1@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 896×504@30
         Profile 1344×756@30
         Profile 1408×792@30
         Profile 1280×720@24
         Profile 896×504@24
         Profile 1344×756@24
         Profile 1408×792@24
         Profile 1280×720@20
         Profile 896×504@20
         Profile 1344×756@20
         Profile 1408×792@20
         Profile 1280×720@15
         Profile 896×504@15
         Profile 1344×756@15
         Profile 1408×792@15
         Profile 1280×720@5
         Profile 896×504@5
         Profile 1344×756@5
         Profile 1408×792@5
     Source Photo, Image, Source#2@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 1280×720@0
         Profile 896×504@30
         Profile 896×504@0
         Profile 1344×756@30
         Profile 1344×756@0
         Profile 1408×792@30
         Profile 1408×792@0
         Profile 2048×1152@30
         Profile 2048×1152@0

I’m not sure why some profiles seem to come up with a zero frame-rate, I perhaps need to look at that but I read this as essentially telling me that I have 2 MediaFrameSourceGroups here and my RGB stream is in one and my depth streams are in another and so I don’t think that I can use a single MediaCapture and a single MultiSourceMediaFrameReader to map between them.

I think that leaves me with a couple of options;

  • I could try and use a multi source frame reader across Depth + InfraRed and see whether I can do facial detection on the InfraRed images?
  • I could avoid multi source frame readers, use separate readers and take some approach to trying to correlate depth and RGB images myself.

The other thing that surprised me here is that the frame rates of those depth streams (both reported at 15fps) don’t seem to line up with the Research Mode docs – I’ll come back to this.

This was a useful experiment and my inclination is to go with the second approach – have multiple frame readers and try to link up frames as best that I can.

Experiment 2 – Camera Intrinsics, Coordinate Systems

When you receive frames from a media source, they are delivered in the shape of a MediaFrameReference which contains metadata such as timings, durations, formats and a CoordinateSystem and then the VideoMediaFrame itself.

That frame then provides access to (e.g.) the SoftwareBitmap (if it’s been requested by choosing a Cpu preference) and if the frame is a depth frame then it also offers up details on that depth data via the DepthMediaFrame property.

If then add a little code into my example to try and create MediaCapture and MediaFrameSource instances for me as below;

 async Task<(MediaCapture capture, MediaFrameSource source)> GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind sourceKind,
            int width,
            int height,
            int frameRate)
        {
            MediaCapture mediaCapture = null;
            MediaFrameSource frameSource = null;

            var allSources = await MediaFrameSourceGroup.FindAllAsync();

            // Ignore frame rate here on the description as both depth streams seem to tell me they are
            // 30fps whereas I don't think they are (from the docs) so I leave that to query later on.
            var sourceInfo =
                allSources.SelectMany(group => group.SourceInfos)
                .FirstOrDefault(
                    si =>
                        (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                        (si.SourceKind == sourceKind) &&
                        (si.VideoProfileMediaDescription.Any(
                            desc =>
                                desc.Width == width &&
                                desc.Height == height &&
                                desc.FrameRate == frameRate)));

            if (sourceInfo != null)
            {
                var sourceGroup = sourceInfo.SourceGroup;

                mediaCapture = new MediaCapture();

                await mediaCapture.InitializeAsync(
                   new MediaCaptureInitializationSettings()
                   {
                       // I want software bitmaps
                       MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                       SourceGroup = sourceGroup,
                       StreamingCaptureMode = StreamingCaptureMode.Video
                   }
                );
                frameSource = mediaCapture.FrameSources[sourceInfo.Id];

                var selectedFormat = frameSource.SupportedFormats.First(
                    format => format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                    format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate);

                await frameSource.SetFormatAsync(selectedFormat);
            }
            return (mediaCapture, frameSource);
        }

then I can open up both an RGB frame reader and a depth frame reader and have a bit of a look at what’s present there…

var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Color, 1280, 720, 30);

            var rgbReader = await rgbMedia.capture.CreateFrameReaderAsync(rgbMedia.source);

            rgbReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            var depthMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Depth, 448, 450, 15);

            var depthReader = await depthMedia.capture.CreateFrameReaderAsync(depthMedia.source);

            depthReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                (s,e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            Debug.WriteLine($"Frame of type {frame.SourceKind}");
                            Debug.WriteLine($"Intrinsics present? {frame.VideoMediaFrame.CameraIntrinsics != null}");
                            Debug.WriteLine($"Coordinate system present? {frame.CoordinateSystem != null}");
                        }
                    }
                };

            rgbReader.FrameArrived += handler;
            depthReader.FrameArrived += handler;

            await rgbReader.StartAsync();
            await depthReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            rgbReader.Dispose();
            depthReader.Dispose();

            rgbMedia.capture.Dispose();
            depthMedia.capture.Dispose();

and the output that I get is a little disappointing…

Frame of type Color
Intrinsics present? False
Coordinate system present? False
Frame of type Depth
Intrinsics present? False
Coordinate system present? False

so I don’t seem to get CameraIntrinsics or a CoordinateSystem on either of these frame types and I was thinking that I’d probably need both of these things in order to be able to transform from an X,Y pixel co-ordinate to world space.

I was especially hoping that the CameraIntrinsics might enable me to use this API on the DepthMediaFrame;

DepthMediaFrame.TryCreateCoordinateMapper

which sounds like it might be exactly what I need to transform points and I’ve seen this used in samples.

That lack of CameraIntrinsics seems to be picked up in this forum post;

CameraIntrinsics always null

and I wonder whether this might be a bug or maybe I’m missing some flag to switch it on but I haven’t figured that out at the time of writing.

I did also attempt to get the CameraIntrinsics by reaching into the MediaFrameSource and using the TryGetCameraIntrinsics method but I found that this seemed to return NULL for all the combinations of parameters that I passed to it.

What to do? Well, that locatable camera article suggested that there may be more than one way to go about this, specifically if I have these 3 GUIDs;

        static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
        static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
        static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

then I can index into the properties that are part of the MediaFrameReference and if I change my innermost if statement to be;

         if (frame != null)
                        {
                            Debug.WriteLine($"Frame of type {frame.SourceKind}");

                            SpatialCoordinateSystem coordinateSystem = null;
                            byte[] viewTransform = null;
                            byte[] projectionTransform = null;
                            object value;

                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
                            {
                                coordinateSystem = value as SpatialCoordinateSystem;
                            }
                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
                            {
                                projectionTransform = value as byte[];
                            }
                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
                            {
                                viewTransform = value as byte[];
                            }

                            Debug.WriteLine($"Coordinate system present? {coordinateSystem != null}");
                            Debug.WriteLine($"View transform present? {viewTransform != null}");
                            Debug.WriteLine($"Projection transform present? {projectionTransform != null}");
                        }

then I get output that indicates that I can get hold of the SpatialCoordinateSystem and the View Transform and Projection Transform for the RGB camera. I get nothing back for the depth camera.

So, that’s been a useful experiment – it tells me that I might be able to transform from an X,Y pixel co-ordinate back to world space although I need to be able to translate those byte[] arrays back into matrices.

I’m not sure quite how to do that but I wrote;

        static Matrix4x4 ByteArrayToMatrix(byte[] bits)
        {
            Matrix4x4 matrix = Matrix4x4.Identity;

            var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
            matrix = Marshal.PtrToStructure<Matrix4x4>(handle.AddrOfPinnedObject());
            handle.Free();

            return (matrix);
        }

and maybe that will do it for me if my assumption about how those matrices have been packed as byte[] is right?

Experiment 3 – Getting Depth Values

I wondered what it looked like to get depth values from the depth sensor and so I reworked the code above a little to bring in the infamous IMemoryBufferByteAccess (meaning that I have to compile with unsafe code);

    [ComImport]
    [Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    unsafe interface IMemoryBufferByteAccess
    {
        void GetBuffer(out byte* buffer, out uint capacity);
    }

and then reworked my frame handling code so as to look as below;

 var depthMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Depth, 448, 450, 15);

            var depthReader = await depthMedia.capture.CreateFrameReaderAsync(depthMedia.source);

            depthReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                (s,e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            var centrePoint = new Point(
                                frame.Format.VideoFormat.Width / 2,
                                frame.Format.VideoFormat.Height / 2);

                            using (var bitmap = frame.VideoMediaFrame.SoftwareBitmap)
                            using (var buffer = bitmap.LockBuffer(BitmapBufferAccessMode.Read))
                            using (var reference = buffer.CreateReference())
                            {
                                var description = buffer.GetPlaneDescription(0);
                                var bytesPerPixel = description.Stride / description.Width;

                                Debug.Assert(bytesPerPixel == Marshal.SizeOf<UInt16>());

                                int offset =
                                    (description.StartIndex + description.Stride * (int)centrePoint.Y) +
                                    ((int)centrePoint.X * bytesPerPixel);

                                UInt16 depthValue = 0;

                                unsafe
                                {
                                    byte* pBits;
                                    uint size;
                                    var byteAccess = reference as IMemoryBufferByteAccess;
                                    byteAccess.GetBuffer(out pBits, out size);
                                    depthValue = *(UInt16*)(pBits + offset);
                                }
                                Debug.WriteLine($"Depth in centre is {depthValue}");
                            }
                        }
                    }
                };

            depthReader.FrameArrived += handler;

            await depthReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            depthReader.Dispose();

            depthMedia.capture.Dispose();

and so my hope is to trace out the depth value that is obtained from the centre point of the depth frame itself. I ran this and saw this type of output;

Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.43m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.577m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.579m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.583m

and the code seemed to be ‘working’ except that I noticed that when I pointed the device at more distant objects (perhaps > 0.7m) it seemed to be coming back (consistently) with a 4.09m (4090 or FFA) value which felt like some kind of ‘out of range’ regardless of the fact that the maximum reliable depth is being reported as 65m (which seems a little unlikely! Winking smile).

I can only assume that this is the ‘near’ sensor and its ID seems to be;

“Source#0@\\\\?\\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}”

I know that the HoloLens has 2 depth streams described in the documents here as;

“Two versions of the depth camera data – one for high-frequency (30 FPS) near-depth sensing, commonly used in hand tracking, and the other for lower-frequency (1 FPS) far-depth sensing, currently used by Spatial Mapping”

Now, my device seems to report two depth streams of the same dimensions (448 x 450) and of the same frame rate (15fps) so that doesn’t seem to line up with the docs.

Putting that to one side, my code had been written to simply select whichever sensor matching my search criteria came First() (in the LINQ sense) and to ignore any others.

I switched the code to select the Last() and saw values;

Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 1.562m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m

and so it now seemed that the device was returning 4.09m (4090) for surfaces that were nearer to it (perhaps < 0.7m away) while it was correctly reporting the more distant surfaces.

I can only assume that this device is the ‘far sensor’ and its ID seems to be;

“Source#2@\\\\?\\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}”

I guess that if you wanted to work reliably, you might have to take streams from both depth cameras and use whichever gave you a reliable result but for my purposes I’m going to go with the ‘far’ sensor rather than the ‘near’ sensor.

One other thing that I’d flag here – I didn’t seem to see 15fps from the depth stream, the rate seemed more like 1fps which ties up with the Research Mode docs for the ‘long range’ depth sensor so maybe the API reporting 15fps isn’t right here.

Experiment 5 – Tracking Faces

Ironically, the detection of faces feels like “the easy part” of what I’m trying to do here and it’s purely because the UWP already has APIs which track faces for me so it’s not such a big deal to make use of it.

I can take my code which is already getting hold of video frames (with SoftwareBitmaps) and just try and feed them through a FaceDetector or FaceTracker and it will give me back lists of bounding boxes of the faces that it detects.

The only potential ‘fly in the ointment’ here is that the detection requires bitmaps in specific formats and there’s an API for querying which formats are supported but that means that I need to try and either;

  • Ensure that I ask the media capture APIs to hand me back bitmaps in one of the formats that is supported by the face detection APIs.

or

  • Accept that the media capture APIs might not be able to do that and so gracefully fallback and accept some other format which I then convert on a frame-by-frame basis to one of the ones supported by the face detection APIs.

The second option is the ‘right’ way to do things but it means writing a bit more code and so I left the conversion for ‘another day’ and modified my GetMediaCaptureForDescriptionAsync method (not repeated here) to take an additional (optional) parameter which lets me specify that I want to narrow down my search for a media source to include specifying the set of bitmap formats that I’m prepared to accept;

            var supportedFormats = FaceTracker.GetSupportedBitmapPixelFormats().Select(
                format => format.ToString().ToLower()).ToArray();

            var tracker = await FaceTracker.CreateAsync();

            // We are assuming (!) that we can get frames in a format compatible with the
            // FaceTracker.
            var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Color, 1280, 720, 30,
                supportedFormats);

            var rgbReader = await rgbMedia.capture.CreateFrameReaderAsync(rgbMedia.source);

            rgbReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                async (s, e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            using (var videoFrame = frame.VideoMediaFrame.GetVideoFrame())
                            {
                                var faces = await tracker.ProcessNextFrameAsync(videoFrame);

                                foreach (var face in faces)
                                {
                                    Debug.WriteLine($"Face found at {face.FaceBox.X}, {face.FaceBox.Y}");
                                }
                            }
                        }
                    }
                };

            rgbReader.FrameArrived += handler;

            await rgbReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            rgbReader.Dispose();

            rgbMedia.capture.Dispose();

and that seemed to work quite nicely – I’m getting video frames from the RGB camera and finding faces within them.

It’s worth possibly saying that in doing this I came (once again) across the place where MediaFrameFormat.Subtype which contains names for the Subtypes taken from this doc page and matching that up to BitmapPixelFormat feels like a very imprecise science and the docs even have a warning around these subtypes;

“The string values returned by the MediaEncodingSubtypes properties may not use the same letter casing as AudioEncodingProperties.Subtype, VideoEncodingProperties.Subtype, ContainerEncodingProperties.Subtype, and ImageEncodingProperties.Subtype. For this reason, if you compare the values, you should use a case-insensitive comparison or use hardcoded strings that match the casing returned by the encoding properties.”

Experiment 6 – Turning X,Y co-ordinates in Images into X,Y,Z co-ordinates in (Unity) World Space

This is the part where I get stuck Smile

I did mention before that I read this article about the locatable camera quite a few times because it seemed to be very relevant to what I’m trying to do;

Locatable Camera

and I especially focused on the section entitled;

Pixel to Application-specified Coordinate System

and its promise of being able to convert from pixel co-ordinates back to world co-ordinates using the camera projection matrix which (I think) I have available to me based on Experiment 2 above.

To experiment with this, I wondered whether I could take the 4 pixel points that represent the camera’s bounding box, project them back from the image into camera space and then into world space and see what they ‘looked like’ in world space by drawing them in Unity at some specified distance.

In doing that, there are perhaps a few things that I’d comment on which may be right/wrong.

  • Getting hold of the SpatialCoordinateSystem that Unity sets up for my holographic app seems to be a matter of calling WorldManager.GetNativeISpatialCoordinateSystemPtr and using Marshal.* methods to get a handle onto the underlying object although I’m unclear whether it’s ok to just hold on to this object indefinitely or not.
  • In transforming back from an X,Y image co-ordinate to a X,Y,Z co-ordinate in world space my approach (following the locatable camera article again) has been to;
    • Translate the X,Y coordinate from the 0-1280, 0-720 range into a –1 to 1, –1 to 1 range.
    • Unproject the vector using the projection transform at a unit distance
    • Multiply the unprojected vector by the inverse of the view transform
    • Multiply that value by the camera to world transform obtained asking the SpatialCoordinateSystem of the RGB frame to provide a transform to the SpatialCoordinateSystem that Unity has set up for the app
    • Multiplying the Z co-ordinate by –1.0f as Unity is a left-handed coordinate system and the holographic UWP APIs are right handed.

I’m not at all sure that I have this right Smile and I was especially unsure around whether the SpatialCoordinateSystem of a frame would change as the device moved around and/or whether the view transform would change. I used the debugger to verify that the view transform definitely changes as the device moves and hence included the inverse of it in the process above.

My code for this experiment (factored into a Unity script) looked like this (pasted in its entirety);

using UnityEngine.XR.WSA;
using System;
using System.Linq;
using UnityEngine;

#if ENABLE_WINMD_SUPPORT
using Windows.Media.Capture;
using Windows.Media.Capture.Frames;
using Windows.Foundation;
using System.Threading.Tasks;
using Windows.Perception.Spatial;
using System.Runtime.InteropServices;
using uVector3 = UnityEngine.Vector3;
using wVector3 = System.Numerics.Vector3;
using wMatrix4x4 = System.Numerics.Matrix4x4;
#endif // ENABLE_WINMD_SUPPORT

public class Placeholder : MonoBehaviour
{
    // Unity line renderer to draw a box for me - note that I'm expecting this to have
    // Loop set to true so that it closes the box off.
    public LineRenderer lineRenderer;

    void Start()
    {
#if ENABLE_WINMD_SUPPORT

        this.OnLoaded();

#endif // ENABLE_WINMD_SUPPORT
    }

#if ENABLE_WINMD_SUPPORT
    async void OnLoaded()
    {
        var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Color, 1280, 720, 30);

        // These should be the corner points for the RGB image...
        var cornerPoints = new Point[]
        {
            new Point(0,0),
            new Point(1280, 0),
            new Point(1280, 720),
            new Point(0, 720)
        };

        var unityWorldCoordinateSystem =
            Marshal.GetObjectForIUnknown(WorldManager.GetNativeISpatialCoordinateSystemPtr()) as SpatialCoordinateSystem;
        
        var rgbFrameReader = await rgbMedia.Item1.CreateFrameReaderAsync(rgbMedia.Item2);
        
        rgbFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
            (s, e) =>
            {
                using (var frame = s.TryAcquireLatestFrame())
                {
                    if (frame != null)
                    {
                        SpatialCoordinateSystem coordinateSystem = null;
                        wMatrix4x4 projectionTransform = wMatrix4x4.Identity;
                        wMatrix4x4 viewTransform = wMatrix4x4.Identity;
                        wMatrix4x4 invertedViewTransform = wMatrix4x4.Identity;

                        object value;

                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
                        {
                            // I'm not sure that this coordinate system changes per-frame so I could maybe do this once?
                            coordinateSystem = value as SpatialCoordinateSystem;
                        }
                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
                        {
                            // I don't think that this transform changes per-frame so I could maybe do this once?
                            projectionTransform = ByteArrayToMatrix(value as byte[]);
                        }
                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
                        {
                            // I think this transform changes per frame.
                            viewTransform = ByteArrayToMatrix(value as byte[]);
                            wMatrix4x4.Invert(viewTransform, out invertedViewTransform);
                        }

                        var cameraToWorldTransform = coordinateSystem.TryGetTransformTo(unityWorldCoordinateSystem);

                        if (cameraToWorldTransform.HasValue)
                        {
                            var transformedPoints = cornerPoints
                                .Select(point => ScalePointMinusOneToOne(point, frame))
                                .Select(point => UnProjectVector(
                                    new wVector3((float)point.X, (float)point.Y, 1.0f), projectionTransform))
                                .Select(point => wVector3.Transform(point, invertedViewTransform))
                                .Select(point => wVector3.Transform(point, cameraToWorldTransform.Value))
                                .ToArray();

                            UnityEngine.WSA.Application.InvokeOnAppThread(
                                () =>
                                {
                                    this.lineRenderer.positionCount = transformedPoints.Length;

                                    // Unity has Z access +ve away from camera, holographic goes the other way.
                                    this.lineRenderer.SetPositions(
                                        transformedPoints.Select(
                                            pt => new uVector3(pt.X, pt.Y, -1.0f * pt.Z)).ToArray());
                                },
                                false);
                        }
                    }
                }
            };

        rgbFrameReader.FrameArrived += handler;

        await rgbFrameReader.StartAsync();

        // Wait forever then dispose...just doing this to keep track of what needs disposing.
        await Task.Delay(-1);

        rgbFrameReader.FrameArrived -= handler;

        Marshal.ReleaseComObject(unityWorldCoordinateSystem);

        rgbFrameReader.Dispose();

        rgbMedia.Item1.Dispose();
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// and hopefully without me breaking it as it's not too complex 🙂
    /// </summary>
    static Point ScalePointMinusOneToOne(Point point, MediaFrameReference frameRef)
    {
        var scaledPoint = new Point(
            (2.0f * (float)point.X / frameRef.Format.VideoFormat.Width) - 1.0f,
            (2.0f * (1.0f - (float)point.Y / frameRef.Format.VideoFormat.Height)) - 1.0f);

        return (scaledPoint);
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// but if it's got messed up in the translation then that's definitely my fault 🙂
    /// </summary>
    static wVector3 UnProjectVector(wVector3 from, wMatrix4x4 cameraProjection)
    {
        var to = new wVector3(0, 0, 0);

        var axsX = new wVector3(cameraProjection.M11, cameraProjection.M12, cameraProjection.M13);

        var axsY = new wVector3(cameraProjection.M21, cameraProjection.M22, cameraProjection.M23);

        var axsZ = new wVector3(cameraProjection.M31, cameraProjection.M32, cameraProjection.M33);

        to.Z = from.Z / axsZ.Z;
        to.Y = (from.Y - (to.Z * axsY.Z)) / axsY.Y;
        to.X = (from.X - (to.Z * axsX.Z)) / axsX.X;

        return to;
    }
    // Used an explicit tuple here as I'm in C# 6.0
    async Task<Tuple<MediaCapture, MediaFrameSource>> GetMediaCaptureForDescriptionAsync(
        MediaFrameSourceKind sourceKind,
        int width,
        int height,
        int frameRate)
    {
        MediaCapture mediaCapture = null;
        MediaFrameSource frameSource = null;

        var allSources = await MediaFrameSourceGroup.FindAllAsync();

        var sourceInfo =
            allSources.SelectMany(group => group.SourceInfos)
            .FirstOrDefault(
                si =>
                    (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                    (si.SourceKind == sourceKind) &&
                    (si.VideoProfileMediaDescription.Any(
                        desc =>
                            desc.Width == width &&
                            desc.Height == height &&
                            desc.FrameRate == frameRate)));

        if (sourceInfo != null)
        {
            var sourceGroup = sourceInfo.SourceGroup;

            mediaCapture = new MediaCapture();

            await mediaCapture.InitializeAsync(
               new MediaCaptureInitializationSettings()
               {
                   // I want software bitmaps
                   MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                   SourceGroup = sourceGroup,
                   StreamingCaptureMode = StreamingCaptureMode.Video
               }
            );
            frameSource = mediaCapture.FrameSources[sourceInfo.Id];

            var selectedFormat = frameSource.SupportedFormats.First(
                format => format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate);

            await frameSource.SetFormatAsync(selectedFormat);
        }
        return (Tuple.Create(mediaCapture, frameSource));
    }
    static wMatrix4x4 ByteArrayToMatrix(byte[] bits)
    {
        var matrix = wMatrix4x4.Identity;

        var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
        matrix = Marshal.PtrToStructure<wMatrix4x4>(handle.AddrOfPinnedObject());
        handle.Free();

        return (matrix);
    }
    static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
    static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
    static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

#endif // ENABLE_WINMD_SUPPORT
}

and this seemed to work out ok in the sense that I could run this code on my HoloLens and see a painted red line demarcating what ‘felt’ like it might be the right positions of the camera’s view and that box appeared to do the right thing as I moved around and rotated the device etc. but I wouldn’t have placed (much) money on it being correct just yet Smile

Experiment 7 – Mapping Between RGB Co-ordinates and Depth Co-Ordinates

The last experiment that I wanted to try was to see if I could figure out how to map co-ordinates from the RGB image to the depth image.

On the one hand, this seems like it might be ‘obvious’ in the sense that if I have a pixel at some X,Y in an RGB image [0,0,1280,720] and if I have some depth image which is 448×450 then I can just come up with a point which is [ X / 1280 *448, Y / 720 * 450] and use that as the position in the depth image.

However, I don’t know whether the depth image is meant to line up with the RGB image in that way or whether I should use other techniques in trying to map depth image coordinates to/from RGB image coordinates.

While I was experimenting with this, an additional sample was published around working with ‘research mode’ and so I was able to refer to it;

https://github.com/Microsoft/HoloLensForCV

and, firstly, I found that I had to make a minor modification to it in the FrameRenderer.cpp code that it uses because around line 348 it hard-codes the depth range to 0.2m to 1.0m whereas I want ranges beyond 1.0m.

NB: I think that has now been addressed – see this issue.

With that modification made, I saw the output from the depth camera as below;

foo

which seems to suggest that the depth values that I want aren’t present across the whole frame (448 x 450) of the depth image but look to be, instead, present in a circular area which you can see highlighted above.

That also seems to be the case for the “Long Throw ToF Reflectivity” stream and I can speculate that maybe those sensors focus (for power/performance reasons?) around the centre of the user’s gaze but that’s just speculation, I don’t see that written down anywhere at the time of writing.

Furthermore, that circular area does not seem to line up with the centre of the depth frame. For instance, in the image below;

sketch

my gaze is on the corner of the book-case marked with a green X which looks to be fairly centrally located in the RGB camera captured image here but the centre of the depth frame seems biased towards the top of the frame and so I can’t simply assume that I can scale coordinates from the RGB frame to the depth frame and come away with reasonable depth values.

This made my original idea seem a lot less practical than it might have seemed when I first started writing this post because I’d assumed that every RGB camera pixel would have a natural corresponding depth camera pixel and I’m not sure whether that’s going to be the case.

So, perhaps the depth camera is better for working out depths around where the user’s gaze is positioned (which makes sense) and my facial example is then only realistically going to ‘work’ if the user is looking directly at a face.

Additionally, even if I assume that I want to measure the depth value at the centre of the RGB image [640,360] then I can’t assume that this maps to the co-ordinate [224, 225] in the depth image because the depth image seems to incorporate a vertical offset.

Or…maybe I’m just missing quite a lot about how these streams can be tied together? Smile 

I wanted to see what did happen if I brought the pieces together that I had so far and so I tried to put together a Unity script which moves a GameObject (e.g. small sphere) to the centre point of any face that it detects in the RGB stream.

That script is below (it needs factoring out into classes as it’s mostly one large function at the moment);

//#define HUNT_DEPTH_PIXEL_GRID
#define USE_CENTRE_DEPTH_IMAGE
using UnityEngine.XR.WSA;
using System;
using System.Linq;
using UnityEngine;
using System.Threading;

#if ENABLE_WINMD_SUPPORT
using Windows.Media.Capture;
using Windows.Media.Capture.Frames;
using Windows.Foundation;
using System.Threading.Tasks;
using Windows.Perception.Spatial;
using System.Runtime.InteropServices;
using Windows.Media.FaceAnalysis;
using Windows.Graphics.Imaging;
using uVector3 = UnityEngine.Vector3;
using wVector3 = System.Numerics.Vector3;
using wVector4 = System.Numerics.Vector4;
using wMatrix4x4 = System.Numerics.Matrix4x4;

[ComImport]
[Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
unsafe interface IMemoryBufferByteAccess
{
    void GetBuffer(out byte* buffer, out uint capacity);
}

#endif // ENABLE_WINMD_SUPPORT

public class Placeholder : MonoBehaviour
{
    // A Unity text mesh that I can print some diagnostics to.
    public TextMesh textMesh;

    // A Unity game object (small sphere e.g.) that I can use to mark the position of one face.
    public GameObject faceMarker;

    void Start()
    {
#if ENABLE_WINMD_SUPPORT

        // Not awaiting this...let it go.
        this.ProcessingLoopAsync();

#endif // ENABLE_WINMD_SUPPORT
    }

#if ENABLE_WINMD_SUPPORT
    /// <summary>
    /// This is just one big lump of code right now which should be factored out into some kind of
    /// 'frame reader' class which can then be subclassed for depth frame and video frame but
    /// it was handy to have it like this while I experimented with it - the intention was
    /// to tidy it up if I could get it doing more or less what I wanted 🙂
    /// </summary>
    async Task ProcessingLoopAsync()
    {
        var depthMediaCapture = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Depth, 448, 450, 15);

        var depthFrameReader = await depthMediaCapture.Item1.CreateFrameReaderAsync(depthMediaCapture.Item2);

        depthFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        MediaFrameReference lastDepthFrame = null;

        long depthFrameCount = 0;
        float centrePointDepthInMetres = 0.0f;

        // Expecting this to run at 1fps although the API (seems to) reports that it runs at 15fps
        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> depthFrameHandler =
            (sender, args) =>
            {
                using (var depthFrame = sender.TryAcquireLatestFrame())
                {
                    if ((depthFrame != null) && (depthFrame != lastDepthFrame))
                    {
                        lastDepthFrame = depthFrame;

                        Interlocked.Increment(ref depthFrameCount);

                        // Always try to grab the depth value although, clearly, this is subject
                        // to a bunch of race conditions etc. as other thread access it.
                        centrePointDepthInMetres =
                            GetDepthValueAtCoordinate(depthFrame,
                                (int)(depthFrame.Format.VideoFormat.Width * MAGIC_DEPTH_FRAME_WIDTH_RATIO_CENTRE),
                                (int)(depthFrame.Format.VideoFormat.Height * MAGIC_DEPTH_FRAME_HEIGHT_RATIO_CENTRE)) ?? 0.0f;

                    }
                }
            };

        long rgbProcessedCount = 0;
        long facesPresentCount = 0;
        long rgbDroppedCount = 0;

        MediaFrameReference lastRgbFrame = null;

        var faceBitmapFormats = FaceTracker.GetSupportedBitmapPixelFormats().Select(
            format => format.ToString().ToLower()).ToArray();

        var faceTracker = await FaceTracker.CreateAsync();

        var rgbMediaCapture = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Color, 1280, 720, 30, faceBitmapFormats);

        var rgbFrameReader = await rgbMediaCapture.Item1.CreateFrameReaderAsync(rgbMediaCapture.Item2);

        rgbFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        int busyProcessingRgbFrame = 0;

        var unityWorldCoordinateSystem =
            Marshal.GetObjectForIUnknown(WorldManager.GetNativeISpatialCoordinateSystemPtr()) as SpatialCoordinateSystem;
        
        // Expecting this to run at 30fps.
        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> rgbFrameHandler =
           (sender, args) =>
           {
               // Only proceed if we're not already 'busy' - i.e. we'
               if (Interlocked.CompareExchange(ref busyProcessingRgbFrame, 1, 0) == 0)
               {
                   Task.Run(
                       async () =>
                       {
                           using (var rgbFrame = rgbFrameReader.TryAcquireLatestFrame())
                           {
                               if ((rgbFrame != null) && (rgbFrame != lastRgbFrame))
                               {
                                   ++rgbProcessedCount;

                                   lastRgbFrame = rgbFrame;
                                   var facePosition = uVector3.zero;

                                   using (var videoFrame = rgbFrame.VideoMediaFrame.GetVideoFrame())
                                   {
                                       var faces = await faceTracker.ProcessNextFrameAsync(videoFrame);
                                       var firstFace = faces.FirstOrDefault();

                                       if (firstFace != null)
                                       {
                                           ++facesPresentCount;

                                           // Take the first face and the centre point of that face to try
                                           // and simplify things for my limited brain.
                                           var faceCentrePointInImageCoords =
                                              new Point(
                                                  firstFace.FaceBox.X + (firstFace.FaceBox.Width / 2.0f),
                                                  firstFace.FaceBox.Y + (firstFace.FaceBox.Height / 2.0f));

                                           wMatrix4x4 projectionTransform = wMatrix4x4.Identity;
                                           wMatrix4x4 viewTransform = wMatrix4x4.Identity;
                                           wMatrix4x4 invertedViewTransform = wMatrix4x4.Identity;

                                           var rgbCoordinateSystem = GetRgbFrameProjectionAndCoordinateSystemDetails(
                                               rgbFrame, out projectionTransform, out invertedViewTransform);

                                           // Scale the RGB point (1280x720)
                                           var faceCentrePointUnitScaleRGB = ScalePointMinusOneToOne(faceCentrePointInImageCoords, rgbFrame);

                                           // Unproject the RGB point back at unit depth as per the locatable camera
                                           // document.
                                           var unprojectedFaceCentrePointRGB = UnProjectVector(
                                                  new wVector3(
                                                      (float)faceCentrePointUnitScaleRGB.X,
                                                      (float)faceCentrePointUnitScaleRGB.Y,
                                                      1.0f),
                                                  projectionTransform);

                                           // Transfrom this back by the inverted view matrix in order to put this into
                                           // the RGB camera coordinate system
                                           var faceCentrePointCameraCoordsRGB =
                                                  wVector3.Transform(unprojectedFaceCentrePointRGB, invertedViewTransform);

                                           // Get the transform from the camera coordinate system to the Unity world
                                           // coordinate system, could probably cache this?
                                           var cameraRGBToWorldTransform =
                                                  rgbCoordinateSystem.TryGetTransformTo(unityWorldCoordinateSystem);

                                           if (cameraRGBToWorldTransform.HasValue)
                                           {
                                               // Transform to world coordinates
                                               var faceCentrePointWorldCoords = wVector4.Transform(
                                                      new wVector4(
                                                          faceCentrePointCameraCoordsRGB.X,
                                                          faceCentrePointCameraCoordsRGB.Y,
                                                          faceCentrePointCameraCoordsRGB.Z, 1),
                                                      cameraRGBToWorldTransform.Value);

                                               // Where's the camera in world coordinates?
                                               var cameraOriginWorldCoords = wVector4.Transform(
                                                      new wVector4(0, 0, 0, 1),
                                                      cameraRGBToWorldTransform.Value);

                                               // Multiply Z by -1 for Unity
                                               var cameraPoint = new uVector3(
                                                    cameraOriginWorldCoords.X,
                                                    cameraOriginWorldCoords.Y,
                                                    -1.0f * cameraOriginWorldCoords.Z);

                                               // Multiply Z by -1 for Unity
                                               var facePoint = new uVector3(
                                                      faceCentrePointWorldCoords.X,
                                                      faceCentrePointWorldCoords.Y,
                                                      -1.0f * faceCentrePointWorldCoords.Z);

                                               facePosition = 
                                                   cameraPoint + 
                                                   (facePoint - cameraPoint).normalized * centrePointDepthInMetres;
                                           }
                                       }
                                   }
                                   if (facePosition != uVector3.zero)
                                   {
                                       UnityEngine.WSA.Application.InvokeOnAppThread(
                                           () =>
                                           {
                                               this.faceMarker.transform.position = facePosition;
                                           },
                                           false
                                        );
                                   }
                               }
                           }
                           Interlocked.Exchange(ref busyProcessingRgbFrame, 0);
                       }
                   );
               }
               else
               {
                   Interlocked.Increment(ref rgbDroppedCount);
               }
               // NB: this is a bit naughty as I am accessing these counters across a few threads so
               // accuracy might suffer here.
               UnityEngine.WSA.Application.InvokeOnAppThread(
                   () =>
                   {
                       this.textMesh.text =
                           $"{depthFrameCount} depth,{rgbProcessedCount} rgb done, {rgbDroppedCount} rgb drop," +
                           $"{facesPresentCount} faces, ({centrePointDepthInMetres:N2})";
                   },
                   false);
           };

        depthFrameReader.FrameArrived += depthFrameHandler;
        rgbFrameReader.FrameArrived += rgbFrameHandler;

        await depthFrameReader.StartAsync();
        await rgbFrameReader.StartAsync();

        // Wait forever then dispose...just doing this to keep track of what needs disposing.
        await Task.Delay(-1);

        depthFrameReader.FrameArrived -= depthFrameHandler;
        rgbFrameReader.FrameArrived -= rgbFrameHandler;

        rgbFrameReader.Dispose();
        depthFrameReader.Dispose();

        rgbMediaCapture.Item1.Dispose();
        depthMediaCapture.Item1.Dispose();

        Marshal.ReleaseComObject(unityWorldCoordinateSystem);
    }


    static SpatialCoordinateSystem GetRgbFrameProjectionAndCoordinateSystemDetails(
        MediaFrameReference rgbFrame,
        out wMatrix4x4 projectionTransform,
        out wMatrix4x4 invertedViewTransform)
    {
        SpatialCoordinateSystem rgbCoordinateSystem = null;
        wMatrix4x4 viewTransform = wMatrix4x4.Identity;
        projectionTransform = wMatrix4x4.Identity;
        invertedViewTransform = wMatrix4x4.Identity;

        object value;

        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
        {
            // I'm not sure that this coordinate system changes per-frame so I could maybe do this once?
            rgbCoordinateSystem = value as SpatialCoordinateSystem;
        }
        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
        {
            // I don't think that this transform changes per-frame so I could maybe do this once?
            projectionTransform = ByteArrayToMatrix(value as byte[]);
        }
        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
        {
            // I think this transform changes per frame.
            viewTransform = ByteArrayToMatrix(value as byte[]);
            wMatrix4x4.Invert(viewTransform, out invertedViewTransform);
        }
        return (rgbCoordinateSystem);
    }
    /// <summary>
    /// Not using this right now as I don't *know* how to scale an RGB point to a depth point
    /// given that the depth frame seems to have a central 'hot spot' that's circular.
    /// </summary>
    static Point ScaleRgbPointToDepthPoint(Point rgbPoint, MediaFrameReference rgbFrame,
        MediaFrameReference depthFrame)
    {
        return (new Point(
            rgbPoint.X / rgbFrame.Format.VideoFormat.Width * depthFrame.Format.VideoFormat.Width,
            rgbPoint.Y / rgbFrame.Format.VideoFormat.Height * depthFrame.Format.VideoFormat.Height));
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// and hopefully without me breaking it too badly.
    /// </summary>
    static Point ScalePointMinusOneToOne(Point point, MediaFrameReference frameRef)
    {
        var scaledPoint = new Point(
            (2.0f * (float)point.X / frameRef.Format.VideoFormat.Width) - 1.0f,
            (2.0f * (1.0f - (float)point.Y / frameRef.Format.VideoFormat.Height)) - 1.0f);

        return (scaledPoint);
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// but if it's got messed up in the translation then that's definitely my fault 🙂
    /// </summary>
    static wVector3 UnProjectVector(wVector3 from, wMatrix4x4 cameraProjection)
    {
        var to = new wVector3(0, 0, 0);

        var axsX = new wVector3(cameraProjection.M11, cameraProjection.M12, cameraProjection.M13);

        var axsY = new wVector3(cameraProjection.M21, cameraProjection.M22, cameraProjection.M23);

        var axsZ = new wVector3(cameraProjection.M31, cameraProjection.M32, cameraProjection.M33);

        to.Z = from.Z / axsZ.Z;
        to.Y = (from.Y - (to.Z * axsY.Z)) / axsY.Y;
        to.X = (from.X - (to.Z * axsX.Z)) / axsX.X;

        return to;
    }
    unsafe static float? GetDepthValueAtCoordinate(MediaFrameReference frame, int x, int y)
    {
        float? depthValue = null;

        var bitmap = frame.VideoMediaFrame.SoftwareBitmap;

        using (var buffer = bitmap.LockBuffer(BitmapBufferAccessMode.Read))
        using (var reference = buffer.CreateReference())
        {
            var description = buffer.GetPlaneDescription(0);

            byte* pBits;
            uint size;
            var byteAccess = reference as IMemoryBufferByteAccess;

            byteAccess.GetBuffer(out pBits, out size);

            // Try the pixel value itself and see if we get anything there.
            depthValue = GetDepthValueFromBufferAtXY(
                pBits, x, y, description, (float)frame.VideoMediaFrame.DepthMediaFrame.DepthFormat.DepthScaleInMeters);

#if HUNT_DEPTH_PIXEL_GRID
            if (depthValue == null)
            {
                // If we don't have a value, look for one in the surrounding space (the sub-function copes
                // with us using bad values of x,y).
                var minDistance = double.MaxValue;

                for (int i = 0 - DEPTH_SEARCH_GRID_SIZE; i < DEPTH_SEARCH_GRID_SIZE; i++)
                {
                    for (int j = 0 - DEPTH_SEARCH_GRID_SIZE; j < DEPTH_SEARCH_GRID_SIZE; j++)
                    {
                        var newX = x + i;
                        var newY = y + j;

                        var testValue = GetDepthValueFromBufferAtXY(
                            pBits,
                            newX,
                            newY,
                            description,
                            (float)frame.VideoMediaFrame.DepthMediaFrame.DepthFormat.DepthScaleInMeters);

                        if (testValue != null)
                        {
                            var distance =
                                Math.Sqrt(Math.Pow(newX - x, 2.0) + Math.Pow(newY - y, 2.0));

                            if (distance < minDistance)
                            {
                                depthValue = testValue;
                                minDistance = distance;
                            }
                        }
                    }
                }
            }
#endif // HUNT_DEPTH_PIXEL_GRID
        }
        return (depthValue);
    }
    unsafe static float? GetDepthValueFromBufferAtXY(byte* pBits, int x, int y, BitmapPlaneDescription desc,
        float scaleInMeters)
    {
        float? depthValue = null;

        var bytesPerPixel = desc.Stride / desc.Width;
        Debug.Assert(bytesPerPixel == Marshal.SizeOf<UInt16>());

        int offset = (desc.StartIndex + desc.Stride * y) + (x * bytesPerPixel);

        if ((offset > 0) && (offset < ((long)pBits + (desc.Height * desc.Stride))))
        {
            depthValue = *(UInt16*)(pBits + offset) * scaleInMeters;

            if (!IsValidDepthDistance((float)depthValue))
            {
                depthValue = null;
            }
        }
        return (depthValue);
    }
    static bool IsValidDepthDistance(float depthDistance)
    {
        // If that depth value is > 4.0m then we discard it because it seems like 
        // 4.**m (4.09?) comes back from the sensor when it hasn't really got a value
        return ((depthDistance > 0.5f) && (depthDistance <= 4.0f));
    }
    // Used an explicit tuple here as I'm in C# 6.0
    async Task<Tuple<MediaCapture, MediaFrameSource>> GetMediaCaptureForDescriptionAsync(
        MediaFrameSourceKind sourceKind,
        int width,
        int height,
        int frameRate,
        string[] bitmapFormats = null)
    {
        MediaCapture mediaCapture = null;
        MediaFrameSource frameSource = null;

        var allSources = await MediaFrameSourceGroup.FindAllAsync();

        // Ignore frame rate here on the description as both depth streams seem to tell me they are
        // 30fps whereas I don't think they are (from the docs) so I leave that to query later on.
        // NB: LastOrDefault here is a NASTY, NASTY hack - just my way of getting hold of the 
        // *LAST* depth stream rather than the *FIRST* because I'm assuming that the *LAST*
        // one is the longer distance stream rather than the short distance stream.
        // I should fix this and find a better way of choosing the right depth stream rather
        // than relying on some ordering that's not likely to always work!
        var sourceInfo =
            allSources.SelectMany(group => group.SourceInfos)
            .LastOrDefault(
                si =>
                    (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                    (si.SourceKind == sourceKind) &&
                    (si.VideoProfileMediaDescription.Any(
                        desc =>
                            desc.Width == width &&
                            desc.Height == height &&
                            desc.FrameRate == frameRate)));

        if (sourceInfo != null)
        {
            var sourceGroup = sourceInfo.SourceGroup;

            mediaCapture = new MediaCapture();

            await mediaCapture.InitializeAsync(
               new MediaCaptureInitializationSettings()
               {
               // I want software bitmaps
               MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                   SourceGroup = sourceGroup,
                   StreamingCaptureMode = StreamingCaptureMode.Video
               }
            );
            frameSource = mediaCapture.FrameSources[sourceInfo.Id];

            var selectedFormat = frameSource.SupportedFormats.First(
                format =>
                    format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                    format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate &&
                    ((bitmapFormats == null) || (bitmapFormats.Contains(format.Subtype.ToLower()))));

            await frameSource.SetFormatAsync(selectedFormat);
        }
        return (Tuple.Create(mediaCapture, frameSource));
    }
    static wMatrix4x4 ByteArrayToMatrix(byte[] bits)
    {
        var matrix = wMatrix4x4.Identity;

        var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
        matrix = Marshal.PtrToStructure<wMatrix4x4>(handle.AddrOfPinnedObject());
        handle.Free();

        return (matrix);
    }
#if HUNT_DEPTH_PIXEL_GRID

    static readonly int DEPTH_SEARCH_GRID_SIZE = 32;

#endif // HUNT_DEPTH_PIXEL_GRID

    static readonly float MAGIC_DEPTH_FRAME_HEIGHT_RATIO_CENTRE = 0.25f;
    static readonly float MAGIC_DEPTH_FRAME_WIDTH_RATIO_CENTRE = 0.5f;
    static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
    static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
    static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

#endif // ENABLE_WINMD_SUPPORT
}

and it produces a sort of bouncing ball which (ideally) hovers around faces as shown in the screen capture below which makes it look better than it actually is at finding faces Winking smile

Sketch

with some on-screen diagnostics trying to show how many;

  • depth frames we have seen
  • RGB frames we have seen
  • RGB frames we have ignored because we were still processing the previous frame
  • frames we have seen which contained faces

along with the current depth value obtained from the ‘centre point’ of the camera which you’ll notice in the code is hard-coded to be a point 25% down the frame and 50% across – that’s just a ‘best guess’ right now rather than anything ‘scientific’.

Wrapping Up the Experiments for Now

I clearly need to spend some more time experimenting here as I haven’t quite got to the result that I wanted to but I learned quite a lot along the way even if my results might be a little flawed.

Through this post, I’ve been questioning my initial assumption that using the depth frames for estimating the Z-coordinate of a face was a good route to take.

Maybe that’s not right? Given that I get a long-range depth frame at 1fps and given that the depth data seems to be concentrated in one area of that frame, perhaps it doesn’t make sense to try to use the depth frame in this way to identify the Z-coordinate of a face (or other object in space). Maybe it’s better to go via the regular route of using the spatial mesh which the device builds so well after all?

I need to try a few more things out Smile

Code?

I haven’t published separate pieces of code for all of the experiments above but the Unity project that I have which brought some of them together in the last experiment is in this repo;

https://github.com/mtaulty/FacialDepthExperiments

Note that if you take the code and build it from Unity then you’ll need to mark the C# project assembly as allowing unsafe code before you’ll be able to get Visual Studio to build it – I can’t find a setting in Unity that seems to allow unsafe code in the separate C# project assembly rather than the main executable itself.

Note also that you will need to manually edit the .appxmanifest file to add the restricted persmission for perceptionSensorsExperimental as I wrote up in a previous post because there is no way (as far as I know) to set this in either the Unity or Visual Studio editors.

Lastly, apply a pinch of salt – I’m just experimenting here Smile

Experimenting with Research Mode and Sensor Streams on HoloLens Redstone 4 Preview (Part 2)

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

This is a follow-on from this previous post;

Experimenting with Research Mode and Sensor Streams on HoloLens Redstone 4 Preview

and so please read that post if you want to get the context and, importantly, for the various caveats and links that I had in that post about working with ‘research mode’ in the Redstone 4 Preview on HoloLens.

I updated the code from the previous post to provide what I think is a slightly better experience in that I removed the attempt to display multiple streams from the device at the same time and, instead, switched to a model where I have the app on the device have a notion of the ‘current stream’ that it is sending over the network to the receiving desktop app.

In that desktop app, I can then show the initial stream from the device and allow the user to cycle through the available streams as per the screenshots below. The streams are still not being sent to the desktop at their actual frame rate but, as before, on a timer-based interval which is hard-wired into the HoloLens app for the moment.

Making these changes meant altering the code such that it no longer selects one depth and one infrared stream but, instead, attempts to read from all depth, infrared and colour streams. When the desktop app connects, it is returned the descriptions for these streams and it then has buttons to notify the remote app to switch on to the next/previous stream in its list.

Here’s how that looks across the 8 different streams that I am getting back from the device.

This first one would appear to be an environment tracking stream which is looking more or less ‘straight ahead’ although the image would appear to be rotated 90 degrees anti-clockwise;

1

This second stream would again appear to be environment tracking taking in a scene that’s to the left of my gaze and again rotated 90 degrees anti-clockwise;

2

This next stream is a depth view, looking forward although it can be hard to see much in there without movement to help out.

I’m not sure that I’m building the description of this stream correctly because my code says 15fps whereas the documentation seems to suggest that depth streams are at either 1fps or 30fps so perhaps I have a bug here but this depth stream feels like it is at a wider aperture and so perhaps this is the stream which the docs describe as;

“one for high-frequency (30 fps) near-depth sensing, commonly used in hand tracking”

but that’s only a guess based on what I can visually see in this stream;

3

and the next stream that I get is an infrared stream at 3 fps with what feels like a narrow aperture;

4

with the follow-on stream being depth again at what feels like a narrow aperture;

5

and then I have an environment view to the right side of my gaze rotated 90 degrees anti-clockwise;

6

and another environment view which feels more of less ‘straight ahead’, rotated 90 degrees anti-clockwise;

7

and lastly an infrared view at 3 fps with what feels like a wider aperture;

8

This code feels a bit more ‘usable’ than what I had at the end of the previous blog post and I’ve tried to make it a little more resilient such that should one end of the connection drop, the other app should pause and be capable of reconnecting when its peer returns.

The code for this is committed to master in the same repo as I had in the previous post;

https://github.com/mtaulty/ExperimentalSensorApps

Feel free to take that, experiment with it yourself and so on but keep in mind that it’s a fairly rough experiment rather than some polished sample.