Sketchy Experiments with HoloLens, Facial Tracking, Research Mode, RGB Streams, Depth Streams.

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

For quite a while, I have wanted to put together a small sample that ran on HoloLens which did facial tracking but which also identified in X,Y,Z space where the tracked faces were located.

I should start by saying that there is an official sample which does exactly this and it’s over here on github;

Holographic face tracking sample

and so if you’re looking for a sample to learn from which actually works Winking smile then feel free to stop reading at this point.

I spent a little time looking at this sample and when I dug into it, I noticed that it takes a (reasonable) approach to estimating the Z co-ordinate of the face that has been detected in an RGB frame by assuming that a face at a certain distance (1m) will have a certain width and using that as a basis for working out how far away the face is.

That seems to work pretty well but I’d long since been drawn by the idea that the HoloLens could give me a more accurate measurement of how far away a face was and I wanted to experiment with that.

The rest of this post is my rough notes around some experiments that don’t quite get me to the point of making that work but I wrote things down as I went along thinking that the notes might be useful to me in the future and (maybe) to someone else.

I’d urge you to apply a large pinch of salt to what you see here because I’m very much experimenting and definitely don’t have things figured out fully.

That sample is great though because it brings together a number of the pieces that I’d expect to use in my own experiments such as;

  • It uses the media capture APIs to get access to a web camera and read frames from it – classes like MediaFrameSource and MediaFrameReader
  • It uses the CameraIntrinsics class in order to ‘unproject’ from the pixel X,Y co-ordinates of a web-camera-captured-image back into a co-ordinate system of the web camera itself
  • It uses the SpatialCoordinateSystem class to transform from a co-ordinate system of a web-camera back into the world co-ordinate system of the application itself
  • It uses the FaceTracker class in order to identify faces in images taken from the web-camera (there’s no need to call out to the cloud just to detect the bounding rectangles of a face)

The sample also takes an approach that is similar to the one that I wanted to follow in that it tries to do ‘real time’ processing on video frames taken from a stream coming off the camera rather than e.g. taking a single photo image or recording frames into a file or something in order to do some type of ‘one-shot’ or ‘offline’ processing.

In many ways, then, this sample is what I want but I have a few differences in what I want to do;

  • This sample estimates the Z coordinate of the face whereas I’d like to ask the device to give it to me as accurately as I can get.
  • This sample is written in C++ using DirectX whereas I wanted to do something in C# and Unity as C++ is becoming for me a language that I can better read than write as I do so little of it these days.

Getting to the Z-coordinate feels like it’s the main challenge here and I can think of at least a couple of ways of doing it. One might be to make use of the spatial mesh that the HoloLens makes available and use some kind of ‘ray cast’ from the camera along the user’s gaze vector to calculate the length of the ray that hits the mesh and use that as a depth value. For a long time, this sort of idea has been discussed in the HoloLens developer forums;

Access to raw depth data stream

but it initially felt to me like it might not be quite right for tracking people as they moved around in a room and so rather than going in that direction, I wanted to experiment with the new ‘research mode’ which came to HoloLens in the Redstone 4 release and which gives me direct access to a stream of depth images. It seemed like it might be ‘interesting’ to see what it’s like to directly use the depth frames from the camera in order to calculate a Z-coordinate for a pixel in an image taken from the web (RGB) camera.

I first encountered ‘Research Mode’ in this blog post;

Experimenting with Research Mode and Sensor Streams on HoloLens Redstone 4 Preview

and there are a tonne of caveats around using ‘Research Mode’ and so please read that post and the official docs and make sure that you understand the caveats before you do anything with it on your own devices.

Note also that if you want something more definitive (and correct!) around research mode then some new samples were published while I was experimenting around for this post and so make sure that you check those out here;

https://github.com/Microsoft/HoloLensForCV

With that said, what is it that I wanted to achieve in my sample?

  • Read video frames from the RGB camera at some frequency – this is achievable using the UWP media capture APIs.
  • Identify faces within those video frames – achievable using the UWP face detection APIs which will give me bounding boxes (X,Y, width, height) in pixel co-ordinates within the captured image. I could perhaps simplify this into a single X,Y point per face by taking the centre of the bounding box.
  • Read depth frames from the depth camera at some frequency – this is achievable using the UWP media capture APIs on a device running in ‘research mode’.
  • Somehow correlate a depth frame with a video frame.
  • Use the X,Y co-ordinate of the detected face( s ) in a video frame to index into a frame from the depth camera to determine the depth location at that point.
  • Transform the X,Y co-ordinate from the video frame and the Z co-ordinate from the depth frame back into world space in order to display something (e.g. a box or a 3D face) at the location that the face was detected.

This leaves me with a sort of 50:50 ratio between things that I know can be done and things that I haven’t done before and so I set about a few experiments to test a few things out.

Before doing that though I spent a long time reading a couple of articles in the docs;

and it’s worth saying that the LocatableCamera support in Unity is very close to what I want but, unfortunately, the Video Capture side of those APIs only record video to a file rather than giving me a stream of video frames which I can then process using something like the Face detection APIs and so I don’t think it’s really of use to me.

Other people have noticed this here;

HoloLensCameraStream for Unity

although at the time of writing I haven’t dug too far into that project as I encountered it when I’d nearly completed this post.

With all that said, I switched on ‘research mode’ on my HoloLens and tried a few initial experiments that I tried out to see how things looked…

Experiment 1 – Co-ordinating Frames from Multiple Sources

If I want to read video frames and depth frames at the same time then the UWP has the notion of a MultiSourceMediaFrameReader which can do just that for me firing an event when a frame is available from each of the requested media sources and saving me a bunch of work in trying to piece those frames together myself.

I was curious – could I get a frame from the RGB video camera and a frame from the depth camera at the same time pre-correlated by the system for me?

In looking a little deeper, I’m not sure that I can. As far as I can tell, the only way to get hold of a MultiSourceMediaFrameReader is to call MediaCapture.CreateMultiSourceFrameReaderAsync and that requires a MediaCapture which can be initialised for one and only MediaFrameSourceGroup and as the docs say;

“The MediaFrameSourceGroup object represents a set of media frame sources that can be used simultaneously”

and so if I run this little piece of code on my HoloLens;

var colourGroup = await MediaFrameSourceGroup.FindAllAsync();

            foreach (var group in colourGroup)
            {
                Debug.WriteLine($"Group {group.DisplayName}");
                foreach (var source in group.SourceInfos)
                {
                    Debug.WriteLine($"\tSource {source.MediaStreamType}, {source.SourceKind}, {source.Id}");

                    foreach (var profile in source.VideoProfileMediaDescription)
                    {
                        Debug.WriteLine($"\t\tProfile {profile.Width}x{profile.Height}@{profile.FrameRate:N0}");
                    }
                }
            }

then the output I get is as below;

Group Sensor Streaming
     Source VideoRecord, Depth, Source#0@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@15
     Source VideoRecord, Infrared, Source#1@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@3
     Source VideoRecord, Depth, Source#2@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@15
     Source VideoRecord, Infrared, Source#3@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 448×450@3
     Source VideoRecord, Color, Source#4@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#5@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#6@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
     Source VideoRecord, Color, Source#7@\\?\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}
         Profile 160×480@30
Group MN34150
     Source VideoPreview, Color, Source#0@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 896×504@30
         Profile 1344×756@30
         Profile 1408×792@30
         Profile 1280×720@24
         Profile 896×504@24
         Profile 1344×756@24
         Profile 1408×792@24
         Profile 1280×720@20
         Profile 896×504@20
         Profile 1344×756@20
         Profile 1408×792@20
         Profile 1280×720@15
         Profile 896×504@15
         Profile 1344×756@15
         Profile 1408×792@15
         Profile 1280×720@5
         Profile 896×504@5
         Profile 1344×756@5
         Profile 1408×792@5
     Source VideoRecord, Color, Source#1@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 896×504@30
         Profile 1344×756@30
         Profile 1408×792@30
         Profile 1280×720@24
         Profile 896×504@24
         Profile 1344×756@24
         Profile 1408×792@24
         Profile 1280×720@20
         Profile 896×504@20
         Profile 1344×756@20
         Profile 1408×792@20
         Profile 1280×720@15
         Profile 896×504@15
         Profile 1344×756@15
         Profile 1408×792@15
         Profile 1280×720@5
         Profile 896×504@5
         Profile 1344×756@5
         Profile 1408×792@5
     Source Photo, Image, Source#2@\\?\DISPLAY#INT22B8#4&27b432bd&0&UID139960#{e5323777-f976-4f5b-9b55-b94699c46e44}\{cdd6871a-56ca-4386-bae7-d24b564378a9}
         Profile 1280×720@30
         Profile 1280×720@0
         Profile 896×504@30
         Profile 896×504@0
         Profile 1344×756@30
         Profile 1344×756@0
         Profile 1408×792@30
         Profile 1408×792@0
         Profile 2048×1152@30
         Profile 2048×1152@0

I’m not sure why some profiles seem to come up with a zero frame-rate, I perhaps need to look at that but I read this as essentially telling me that I have 2 MediaFrameSourceGroups here and my RGB stream is in one and my depth streams are in another and so I don’t think that I can use a single MediaCapture and a single MultiSourceMediaFrameReader to map between them.

I think that leaves me with a couple of options;

  • I could try and use a multi source frame reader across Depth + InfraRed and see whether I can do facial detection on the InfraRed images?
  • I could avoid multi source frame readers, use separate readers and take some approach to trying to correlate depth and RGB images myself.

The other thing that surprised me here is that the frame rates of those depth streams (both reported at 15fps) don’t seem to line up with the Research Mode docs – I’ll come back to this.

This was a useful experiment and my inclination is to go with the second approach – have multiple frame readers and try to link up frames as best that I can.

Experiment 2 – Camera Intrinsics, Coordinate Systems

When you receive frames from a media source, they are delivered in the shape of a MediaFrameReference which contains metadata such as timings, durations, formats and a CoordinateSystem and then the VideoMediaFrame itself.

That frame then provides access to (e.g.) the SoftwareBitmap (if it’s been requested by choosing a Cpu preference) and if the frame is a depth frame then it also offers up details on that depth data via the DepthMediaFrame property.

If then add a little code into my example to try and create MediaCapture and MediaFrameSource instances for me as below;

 async Task<(MediaCapture capture, MediaFrameSource source)> GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind sourceKind,
            int width,
            int height,
            int frameRate)
        {
            MediaCapture mediaCapture = null;
            MediaFrameSource frameSource = null;

            var allSources = await MediaFrameSourceGroup.FindAllAsync();

            // Ignore frame rate here on the description as both depth streams seem to tell me they are
            // 30fps whereas I don't think they are (from the docs) so I leave that to query later on.
            var sourceInfo =
                allSources.SelectMany(group => group.SourceInfos)
                .FirstOrDefault(
                    si =>
                        (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                        (si.SourceKind == sourceKind) &&
                        (si.VideoProfileMediaDescription.Any(
                            desc =>
                                desc.Width == width &&
                                desc.Height == height &&
                                desc.FrameRate == frameRate)));

            if (sourceInfo != null)
            {
                var sourceGroup = sourceInfo.SourceGroup;

                mediaCapture = new MediaCapture();

                await mediaCapture.InitializeAsync(
                   new MediaCaptureInitializationSettings()
                   {
                       // I want software bitmaps
                       MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                       SourceGroup = sourceGroup,
                       StreamingCaptureMode = StreamingCaptureMode.Video
                   }
                );
                frameSource = mediaCapture.FrameSources[sourceInfo.Id];

                var selectedFormat = frameSource.SupportedFormats.First(
                    format => format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                    format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate);

                await frameSource.SetFormatAsync(selectedFormat);
            }
            return (mediaCapture, frameSource);
        }

then I can open up both an RGB frame reader and a depth frame reader and have a bit of a look at what’s present there…

var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Color, 1280, 720, 30);

            var rgbReader = await rgbMedia.capture.CreateFrameReaderAsync(rgbMedia.source);

            rgbReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            var depthMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Depth, 448, 450, 15);

            var depthReader = await depthMedia.capture.CreateFrameReaderAsync(depthMedia.source);

            depthReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                (s,e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            Debug.WriteLine($"Frame of type {frame.SourceKind}");
                            Debug.WriteLine($"Intrinsics present? {frame.VideoMediaFrame.CameraIntrinsics != null}");
                            Debug.WriteLine($"Coordinate system present? {frame.CoordinateSystem != null}");
                        }
                    }
                };

            rgbReader.FrameArrived += handler;
            depthReader.FrameArrived += handler;

            await rgbReader.StartAsync();
            await depthReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            rgbReader.Dispose();
            depthReader.Dispose();

            rgbMedia.capture.Dispose();
            depthMedia.capture.Dispose();

and the output that I get is a little disappointing…

Frame of type Color
Intrinsics present? False
Coordinate system present? False
Frame of type Depth
Intrinsics present? False
Coordinate system present? False

so I don’t seem to get CameraIntrinsics or a CoordinateSystem on either of these frame types and I was thinking that I’d probably need both of these things in order to be able to transform from an X,Y pixel co-ordinate to world space.

I was especially hoping that the CameraIntrinsics might enable me to use this API on the DepthMediaFrame;

DepthMediaFrame.TryCreateCoordinateMapper

which sounds like it might be exactly what I need to transform points and I’ve seen this used in samples.

That lack of CameraIntrinsics seems to be picked up in this forum post;

CameraIntrinsics always null

and I wonder whether this might be a bug or maybe I’m missing some flag to switch it on but I haven’t figured that out at the time of writing.

I did also attempt to get the CameraIntrinsics by reaching into the MediaFrameSource and using the TryGetCameraIntrinsics method but I found that this seemed to return NULL for all the combinations of parameters that I passed to it.

What to do? Well, that locatable camera article suggested that there may be more than one way to go about this, specifically if I have these 3 GUIDs;

        static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
        static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
        static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

then I can index into the properties that are part of the MediaFrameReference and if I change my innermost if statement to be;

         if (frame != null)
                        {
                            Debug.WriteLine($"Frame of type {frame.SourceKind}");

                            SpatialCoordinateSystem coordinateSystem = null;
                            byte[] viewTransform = null;
                            byte[] projectionTransform = null;
                            object value;

                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
                            {
                                coordinateSystem = value as SpatialCoordinateSystem;
                            }
                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
                            {
                                projectionTransform = value as byte[];
                            }
                            if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
                            {
                                viewTransform = value as byte[];
                            }

                            Debug.WriteLine($"Coordinate system present? {coordinateSystem != null}");
                            Debug.WriteLine($"View transform present? {viewTransform != null}");
                            Debug.WriteLine($"Projection transform present? {projectionTransform != null}");
                        }

then I get output that indicates that I can get hold of the SpatialCoordinateSystem and the View Transform and Projection Transform for the RGB camera. I get nothing back for the depth camera.

So, that’s been a useful experiment – it tells me that I might be able to transform from an X,Y pixel co-ordinate back to world space although I need to be able to translate those byte[] arrays back into matrices.

I’m not sure quite how to do that but I wrote;

        static Matrix4x4 ByteArrayToMatrix(byte[] bits)
        {
            Matrix4x4 matrix = Matrix4x4.Identity;

            var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
            matrix = Marshal.PtrToStructure<Matrix4x4>(handle.AddrOfPinnedObject());
            handle.Free();

            return (matrix);
        }

and maybe that will do it for me if my assumption about how those matrices have been packed as byte[] is right?

Experiment 3 – Getting Depth Values

I wondered what it looked like to get depth values from the depth sensor and so I reworked the code above a little to bring in the infamous IMemoryBufferByteAccess (meaning that I have to compile with unsafe code);

    [ComImport]
    [Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
    [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
    unsafe interface IMemoryBufferByteAccess
    {
        void GetBuffer(out byte* buffer, out uint capacity);
    }

and then reworked my frame handling code so as to look as below;

 var depthMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Depth, 448, 450, 15);

            var depthReader = await depthMedia.capture.CreateFrameReaderAsync(depthMedia.source);

            depthReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                (s,e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            var centrePoint = new Point(
                                frame.Format.VideoFormat.Width / 2,
                                frame.Format.VideoFormat.Height / 2);

                            using (var bitmap = frame.VideoMediaFrame.SoftwareBitmap)
                            using (var buffer = bitmap.LockBuffer(BitmapBufferAccessMode.Read))
                            using (var reference = buffer.CreateReference())
                            {
                                var description = buffer.GetPlaneDescription(0);
                                var bytesPerPixel = description.Stride / description.Width;

                                Debug.Assert(bytesPerPixel == Marshal.SizeOf<UInt16>());

                                int offset =
                                    (description.StartIndex + description.Stride * (int)centrePoint.Y) +
                                    ((int)centrePoint.X * bytesPerPixel);

                                UInt16 depthValue = 0;

                                unsafe
                                {
                                    byte* pBits;
                                    uint size;
                                    var byteAccess = reference as IMemoryBufferByteAccess;
                                    byteAccess.GetBuffer(out pBits, out size);
                                    depthValue = *(UInt16*)(pBits + offset);
                                }
                                Debug.WriteLine($"Depth in centre is {depthValue}");
                            }
                        }
                    }
                };

            depthReader.FrameArrived += handler;

            await depthReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            depthReader.Dispose();

            depthMedia.capture.Dispose();

and so my hope is to trace out the depth value that is obtained from the centre point of the depth frame itself. I ran this and saw this type of output;

Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.43m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.577m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.579m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 0.583m

and the code seemed to be ‘working’ except that I noticed that when I pointed the device at more distant objects (perhaps > 0.7m) it seemed to be coming back (consistently) with a 4.09m (4090 or FFA) value which felt like some kind of ‘out of range’ regardless of the fact that the maximum reliable depth is being reported as 65m (which seems a little unlikely! Winking smile).

I can only assume that this is the ‘near’ sensor and its ID seems to be;

“Source#0@\\\\?\\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}”

I know that the HoloLens has 2 depth streams described in the documents here as;

“Two versions of the depth camera data – one for high-frequency (30 FPS) near-depth sensing, commonly used in hand tracking, and the other for lower-frequency (1 FPS) far-depth sensing, currently used by Spatial Mapping”

Now, my device seems to report two depth streams of the same dimensions (448 x 450) and of the same frame rate (15fps) so that doesn’t seem to line up with the docs.

Putting that to one side, my code had been written to simply select whichever sensor matching my search criteria came First() (in the LINQ sense) and to ignore any others.

I switched the code to select the Last() and saw values;

Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 1.562m
Max Depth in Metres 65.535m
Min Depth in Metres 0.001m
Depth Format is D16
Depth in centre is 4.09m

and so it now seemed that the device was returning 4.09m (4090) for surfaces that were nearer to it (perhaps < 0.7m away) while it was correctly reporting the more distant surfaces.

I can only assume that this device is the ‘far sensor’ and its ID seems to be;

“Source#2@\\\\?\\Root#SensorStreamingMiniDriver#0000#{e5323777-f976-4f5b-9b55-b94699c46e44}\\{b27e3887-ad10-4a4e-bfb8-d6765add0e38}”

I guess that if you wanted to work reliably, you might have to take streams from both depth cameras and use whichever gave you a reliable result but for my purposes I’m going to go with the ‘far’ sensor rather than the ‘near’ sensor.

One other thing that I’d flag here – I didn’t seem to see 15fps from the depth stream, the rate seemed more like 1fps which ties up with the Research Mode docs for the ‘long range’ depth sensor so maybe the API reporting 15fps isn’t right here.

Experiment 5 – Tracking Faces

Ironically, the detection of faces feels like “the easy part” of what I’m trying to do here and it’s purely because the UWP already has APIs which track faces for me so it’s not such a big deal to make use of it.

I can take my code which is already getting hold of video frames (with SoftwareBitmaps) and just try and feed them through a FaceDetector or FaceTracker and it will give me back lists of bounding boxes of the faces that it detects.

The only potential ‘fly in the ointment’ here is that the detection requires bitmaps in specific formats and there’s an API for querying which formats are supported but that means that I need to try and either;

  • Ensure that I ask the media capture APIs to hand me back bitmaps in one of the formats that is supported by the face detection APIs.

or

  • Accept that the media capture APIs might not be able to do that and so gracefully fallback and accept some other format which I then convert on a frame-by-frame basis to one of the ones supported by the face detection APIs.

The second option is the ‘right’ way to do things but it means writing a bit more code and so I left the conversion for ‘another day’ and modified my GetMediaCaptureForDescriptionAsync method (not repeated here) to take an additional (optional) parameter which lets me specify that I want to narrow down my search for a media source to include specifying the set of bitmap formats that I’m prepared to accept;

            var supportedFormats = FaceTracker.GetSupportedBitmapPixelFormats().Select(
                format => format.ToString().ToLower()).ToArray();

            var tracker = await FaceTracker.CreateAsync();

            // We are assuming (!) that we can get frames in a format compatible with the
            // FaceTracker.
            var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
                MediaFrameSourceKind.Color, 1280, 720, 30,
                supportedFormats);

            var rgbReader = await rgbMedia.capture.CreateFrameReaderAsync(rgbMedia.source);

            rgbReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

            TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
                async (s, e) =>
                {
                    using (var frame = s.TryAcquireLatestFrame())
                    {
                        if (frame != null)
                        {
                            using (var videoFrame = frame.VideoMediaFrame.GetVideoFrame())
                            {
                                var faces = await tracker.ProcessNextFrameAsync(videoFrame);

                                foreach (var face in faces)
                                {
                                    Debug.WriteLine($"Face found at {face.FaceBox.X}, {face.FaceBox.Y}");
                                }
                            }
                        }
                    }
                };

            rgbReader.FrameArrived += handler;

            await rgbReader.StartAsync();

            // Wait forever then dispose...
            await Task.Delay(-1);

            rgbReader.Dispose();

            rgbMedia.capture.Dispose();

and that seemed to work quite nicely – I’m getting video frames from the RGB camera and finding faces within them.

It’s worth possibly saying that in doing this I came (once again) across the place where MediaFrameFormat.Subtype which contains names for the Subtypes taken from this doc page and matching that up to BitmapPixelFormat feels like a very imprecise science and the docs even have a warning around these subtypes;

“The string values returned by the MediaEncodingSubtypes properties may not use the same letter casing as AudioEncodingProperties.Subtype, VideoEncodingProperties.Subtype, ContainerEncodingProperties.Subtype, and ImageEncodingProperties.Subtype. For this reason, if you compare the values, you should use a case-insensitive comparison or use hardcoded strings that match the casing returned by the encoding properties.”

Experiment 6 – Turning X,Y co-ordinates in Images into X,Y,Z co-ordinates in (Unity) World Space

This is the part where I get stuck Smile

I did mention before that I read this article about the locatable camera quite a few times because it seemed to be very relevant to what I’m trying to do;

Locatable Camera

and I especially focused on the section entitled;

Pixel to Application-specified Coordinate System

and its promise of being able to convert from pixel co-ordinates back to world co-ordinates using the camera projection matrix which (I think) I have available to me based on Experiment 2 above.

To experiment with this, I wondered whether I could take the 4 pixel points that represent the camera’s bounding box, project them back from the image into camera space and then into world space and see what they ‘looked like’ in world space by drawing them in Unity at some specified distance.

In doing that, there are perhaps a few things that I’d comment on which may be right/wrong.

  • Getting hold of the SpatialCoordinateSystem that Unity sets up for my holographic app seems to be a matter of calling WorldManager.GetNativeISpatialCoordinateSystemPtr and using Marshal.* methods to get a handle onto the underlying object although I’m unclear whether it’s ok to just hold on to this object indefinitely or not.
  • In transforming back from an X,Y image co-ordinate to a X,Y,Z co-ordinate in world space my approach (following the locatable camera article again) has been to;
    • Translate the X,Y coordinate from the 0-1280, 0-720 range into a –1 to 1, –1 to 1 range.
    • Unproject the vector using the projection transform at a unit distance
    • Multiply the unprojected vector by the inverse of the view transform
    • Multiply that value by the camera to world transform obtained asking the SpatialCoordinateSystem of the RGB frame to provide a transform to the SpatialCoordinateSystem that Unity has set up for the app
    • Multiplying the Z co-ordinate by –1.0f as Unity is a left-handed coordinate system and the holographic UWP APIs are right handed.

I’m not at all sure that I have this right Smile and I was especially unsure around whether the SpatialCoordinateSystem of a frame would change as the device moved around and/or whether the view transform would change. I used the debugger to verify that the view transform definitely changes as the device moves and hence included the inverse of it in the process above.

My code for this experiment (factored into a Unity script) looked like this (pasted in its entirety);

using UnityEngine.XR.WSA;
using System;
using System.Linq;
using UnityEngine;

#if ENABLE_WINMD_SUPPORT
using Windows.Media.Capture;
using Windows.Media.Capture.Frames;
using Windows.Foundation;
using System.Threading.Tasks;
using Windows.Perception.Spatial;
using System.Runtime.InteropServices;
using uVector3 = UnityEngine.Vector3;
using wVector3 = System.Numerics.Vector3;
using wMatrix4x4 = System.Numerics.Matrix4x4;
#endif // ENABLE_WINMD_SUPPORT

public class Placeholder : MonoBehaviour
{
    // Unity line renderer to draw a box for me - note that I'm expecting this to have
    // Loop set to true so that it closes the box off.
    public LineRenderer lineRenderer;

    void Start()
    {
#if ENABLE_WINMD_SUPPORT

        this.OnLoaded();

#endif // ENABLE_WINMD_SUPPORT
    }

#if ENABLE_WINMD_SUPPORT
    async void OnLoaded()
    {
        var rgbMedia = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Color, 1280, 720, 30);

        // These should be the corner points for the RGB image...
        var cornerPoints = new Point[]
        {
            new Point(0,0),
            new Point(1280, 0),
            new Point(1280, 720),
            new Point(0, 720)
        };

        var unityWorldCoordinateSystem =
            Marshal.GetObjectForIUnknown(WorldManager.GetNativeISpatialCoordinateSystemPtr()) as SpatialCoordinateSystem;
        
        var rgbFrameReader = await rgbMedia.Item1.CreateFrameReaderAsync(rgbMedia.Item2);
        
        rgbFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> handler =
            (s, e) =>
            {
                using (var frame = s.TryAcquireLatestFrame())
                {
                    if (frame != null)
                    {
                        SpatialCoordinateSystem coordinateSystem = null;
                        wMatrix4x4 projectionTransform = wMatrix4x4.Identity;
                        wMatrix4x4 viewTransform = wMatrix4x4.Identity;
                        wMatrix4x4 invertedViewTransform = wMatrix4x4.Identity;

                        object value;

                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
                        {
                            // I'm not sure that this coordinate system changes per-frame so I could maybe do this once?
                            coordinateSystem = value as SpatialCoordinateSystem;
                        }
                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
                        {
                            // I don't think that this transform changes per-frame so I could maybe do this once?
                            projectionTransform = ByteArrayToMatrix(value as byte[]);
                        }
                        if (frame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
                        {
                            // I think this transform changes per frame.
                            viewTransform = ByteArrayToMatrix(value as byte[]);
                            wMatrix4x4.Invert(viewTransform, out invertedViewTransform);
                        }

                        var cameraToWorldTransform = coordinateSystem.TryGetTransformTo(unityWorldCoordinateSystem);

                        if (cameraToWorldTransform.HasValue)
                        {
                            var transformedPoints = cornerPoints
                                .Select(point => ScalePointMinusOneToOne(point, frame))
                                .Select(point => UnProjectVector(
                                    new wVector3((float)point.X, (float)point.Y, 1.0f), projectionTransform))
                                .Select(point => wVector3.Transform(point, invertedViewTransform))
                                .Select(point => wVector3.Transform(point, cameraToWorldTransform.Value))
                                .ToArray();

                            UnityEngine.WSA.Application.InvokeOnAppThread(
                                () =>
                                {
                                    this.lineRenderer.positionCount = transformedPoints.Length;

                                    // Unity has Z access +ve away from camera, holographic goes the other way.
                                    this.lineRenderer.SetPositions(
                                        transformedPoints.Select(
                                            pt => new uVector3(pt.X, pt.Y, -1.0f * pt.Z)).ToArray());
                                },
                                false);
                        }
                    }
                }
            };

        rgbFrameReader.FrameArrived += handler;

        await rgbFrameReader.StartAsync();

        // Wait forever then dispose...just doing this to keep track of what needs disposing.
        await Task.Delay(-1);

        rgbFrameReader.FrameArrived -= handler;

        Marshal.ReleaseComObject(unityWorldCoordinateSystem);

        rgbFrameReader.Dispose();

        rgbMedia.Item1.Dispose();
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// and hopefully without me breaking it as it's not too complex 🙂
    /// </summary>
    static Point ScalePointMinusOneToOne(Point point, MediaFrameReference frameRef)
    {
        var scaledPoint = new Point(
            (2.0f * (float)point.X / frameRef.Format.VideoFormat.Width) - 1.0f,
            (2.0f * (1.0f - (float)point.Y / frameRef.Format.VideoFormat.Height)) - 1.0f);

        return (scaledPoint);
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// but if it's got messed up in the translation then that's definitely my fault 🙂
    /// </summary>
    static wVector3 UnProjectVector(wVector3 from, wMatrix4x4 cameraProjection)
    {
        var to = new wVector3(0, 0, 0);

        var axsX = new wVector3(cameraProjection.M11, cameraProjection.M12, cameraProjection.M13);

        var axsY = new wVector3(cameraProjection.M21, cameraProjection.M22, cameraProjection.M23);

        var axsZ = new wVector3(cameraProjection.M31, cameraProjection.M32, cameraProjection.M33);

        to.Z = from.Z / axsZ.Z;
        to.Y = (from.Y - (to.Z * axsY.Z)) / axsY.Y;
        to.X = (from.X - (to.Z * axsX.Z)) / axsX.X;

        return to;
    }
    // Used an explicit tuple here as I'm in C# 6.0
    async Task<Tuple<MediaCapture, MediaFrameSource>> GetMediaCaptureForDescriptionAsync(
        MediaFrameSourceKind sourceKind,
        int width,
        int height,
        int frameRate)
    {
        MediaCapture mediaCapture = null;
        MediaFrameSource frameSource = null;

        var allSources = await MediaFrameSourceGroup.FindAllAsync();

        var sourceInfo =
            allSources.SelectMany(group => group.SourceInfos)
            .FirstOrDefault(
                si =>
                    (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                    (si.SourceKind == sourceKind) &&
                    (si.VideoProfileMediaDescription.Any(
                        desc =>
                            desc.Width == width &&
                            desc.Height == height &&
                            desc.FrameRate == frameRate)));

        if (sourceInfo != null)
        {
            var sourceGroup = sourceInfo.SourceGroup;

            mediaCapture = new MediaCapture();

            await mediaCapture.InitializeAsync(
               new MediaCaptureInitializationSettings()
               {
                   // I want software bitmaps
                   MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                   SourceGroup = sourceGroup,
                   StreamingCaptureMode = StreamingCaptureMode.Video
               }
            );
            frameSource = mediaCapture.FrameSources[sourceInfo.Id];

            var selectedFormat = frameSource.SupportedFormats.First(
                format => format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate);

            await frameSource.SetFormatAsync(selectedFormat);
        }
        return (Tuple.Create(mediaCapture, frameSource));
    }
    static wMatrix4x4 ByteArrayToMatrix(byte[] bits)
    {
        var matrix = wMatrix4x4.Identity;

        var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
        matrix = Marshal.PtrToStructure<wMatrix4x4>(handle.AddrOfPinnedObject());
        handle.Free();

        return (matrix);
    }
    static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
    static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
    static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

#endif // ENABLE_WINMD_SUPPORT
}

and this seemed to work out ok in the sense that I could run this code on my HoloLens and see a painted red line demarcating what ‘felt’ like it might be the right positions of the camera’s view and that box appeared to do the right thing as I moved around and rotated the device etc. but I wouldn’t have placed (much) money on it being correct just yet Smile

Experiment 7 – Mapping Between RGB Co-ordinates and Depth Co-Ordinates

The last experiment that I wanted to try was to see if I could figure out how to map co-ordinates from the RGB image to the depth image.

On the one hand, this seems like it might be ‘obvious’ in the sense that if I have a pixel at some X,Y in an RGB image [0,0,1280,720] and if I have some depth image which is 448×450 then I can just come up with a point which is [ X / 1280 *448, Y / 720 * 450] and use that as the position in the depth image.

However, I don’t know whether the depth image is meant to line up with the RGB image in that way or whether I should use other techniques in trying to map depth image coordinates to/from RGB image coordinates.

While I was experimenting with this, an additional sample was published around working with ‘research mode’ and so I was able to refer to it;

https://github.com/Microsoft/HoloLensForCV

and, firstly, I found that I had to make a minor modification to it in the FrameRenderer.cpp code that it uses because around line 348 it hard-codes the depth range to 0.2m to 1.0m whereas I want ranges beyond 1.0m.

NB: I think that has now been addressed – see this issue.

With that modification made, I saw the output from the depth camera as below;

foo

which seems to suggest that the depth values that I want aren’t present across the whole frame (448 x 450) of the depth image but look to be, instead, present in a circular area which you can see highlighted above.

That also seems to be the case for the “Long Throw ToF Reflectivity” stream and I can speculate that maybe those sensors focus (for power/performance reasons?) around the centre of the user’s gaze but that’s just speculation, I don’t see that written down anywhere at the time of writing.

Furthermore, that circular area does not seem to line up with the centre of the depth frame. For instance, in the image below;

sketch

my gaze is on the corner of the book-case marked with a green X which looks to be fairly centrally located in the RGB camera captured image here but the centre of the depth frame seems biased towards the top of the frame and so I can’t simply assume that I can scale coordinates from the RGB frame to the depth frame and come away with reasonable depth values.

This made my original idea seem a lot less practical than it might have seemed when I first started writing this post because I’d assumed that every RGB camera pixel would have a natural corresponding depth camera pixel and I’m not sure whether that’s going to be the case.

So, perhaps the depth camera is better for working out depths around where the user’s gaze is positioned (which makes sense) and my facial example is then only realistically going to ‘work’ if the user is looking directly at a face.

Additionally, even if I assume that I want to measure the depth value at the centre of the RGB image [640,360] then I can’t assume that this maps to the co-ordinate [224, 225] in the depth image because the depth image seems to incorporate a vertical offset.

Or…maybe I’m just missing quite a lot about how these streams can be tied together? Smile 

I wanted to see what did happen if I brought the pieces together that I had so far and so I tried to put together a Unity script which moves a GameObject (e.g. small sphere) to the centre point of any face that it detects in the RGB stream.

That script is below (it needs factoring out into classes as it’s mostly one large function at the moment);

//#define HUNT_DEPTH_PIXEL_GRID
#define USE_CENTRE_DEPTH_IMAGE
using UnityEngine.XR.WSA;
using System;
using System.Linq;
using UnityEngine;
using System.Threading;

#if ENABLE_WINMD_SUPPORT
using Windows.Media.Capture;
using Windows.Media.Capture.Frames;
using Windows.Foundation;
using System.Threading.Tasks;
using Windows.Perception.Spatial;
using System.Runtime.InteropServices;
using Windows.Media.FaceAnalysis;
using Windows.Graphics.Imaging;
using uVector3 = UnityEngine.Vector3;
using wVector3 = System.Numerics.Vector3;
using wVector4 = System.Numerics.Vector4;
using wMatrix4x4 = System.Numerics.Matrix4x4;

[ComImport]
[Guid("5B0D3235-4DBA-4D44-865E-8F1D0E4FD04D")]
[InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
unsafe interface IMemoryBufferByteAccess
{
    void GetBuffer(out byte* buffer, out uint capacity);
}

#endif // ENABLE_WINMD_SUPPORT

public class Placeholder : MonoBehaviour
{
    // A Unity text mesh that I can print some diagnostics to.
    public TextMesh textMesh;

    // A Unity game object (small sphere e.g.) that I can use to mark the position of one face.
    public GameObject faceMarker;

    void Start()
    {
#if ENABLE_WINMD_SUPPORT

        // Not awaiting this...let it go.
        this.ProcessingLoopAsync();

#endif // ENABLE_WINMD_SUPPORT
    }

#if ENABLE_WINMD_SUPPORT
    /// <summary>
    /// This is just one big lump of code right now which should be factored out into some kind of
    /// 'frame reader' class which can then be subclassed for depth frame and video frame but
    /// it was handy to have it like this while I experimented with it - the intention was
    /// to tidy it up if I could get it doing more or less what I wanted 🙂
    /// </summary>
    async Task ProcessingLoopAsync()
    {
        var depthMediaCapture = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Depth, 448, 450, 15);

        var depthFrameReader = await depthMediaCapture.Item1.CreateFrameReaderAsync(depthMediaCapture.Item2);

        depthFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        MediaFrameReference lastDepthFrame = null;

        long depthFrameCount = 0;
        float centrePointDepthInMetres = 0.0f;

        // Expecting this to run at 1fps although the API (seems to) reports that it runs at 15fps
        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> depthFrameHandler =
            (sender, args) =>
            {
                using (var depthFrame = sender.TryAcquireLatestFrame())
                {
                    if ((depthFrame != null) && (depthFrame != lastDepthFrame))
                    {
                        lastDepthFrame = depthFrame;

                        Interlocked.Increment(ref depthFrameCount);

                        // Always try to grab the depth value although, clearly, this is subject
                        // to a bunch of race conditions etc. as other thread access it.
                        centrePointDepthInMetres =
                            GetDepthValueAtCoordinate(depthFrame,
                                (int)(depthFrame.Format.VideoFormat.Width * MAGIC_DEPTH_FRAME_WIDTH_RATIO_CENTRE),
                                (int)(depthFrame.Format.VideoFormat.Height * MAGIC_DEPTH_FRAME_HEIGHT_RATIO_CENTRE)) ?? 0.0f;

                    }
                }
            };

        long rgbProcessedCount = 0;
        long facesPresentCount = 0;
        long rgbDroppedCount = 0;

        MediaFrameReference lastRgbFrame = null;

        var faceBitmapFormats = FaceTracker.GetSupportedBitmapPixelFormats().Select(
            format => format.ToString().ToLower()).ToArray();

        var faceTracker = await FaceTracker.CreateAsync();

        var rgbMediaCapture = await this.GetMediaCaptureForDescriptionAsync(
            MediaFrameSourceKind.Color, 1280, 720, 30, faceBitmapFormats);

        var rgbFrameReader = await rgbMediaCapture.Item1.CreateFrameReaderAsync(rgbMediaCapture.Item2);

        rgbFrameReader.AcquisitionMode = MediaFrameReaderAcquisitionMode.Realtime;

        int busyProcessingRgbFrame = 0;

        var unityWorldCoordinateSystem =
            Marshal.GetObjectForIUnknown(WorldManager.GetNativeISpatialCoordinateSystemPtr()) as SpatialCoordinateSystem;
        
        // Expecting this to run at 30fps.
        TypedEventHandler<MediaFrameReader, MediaFrameArrivedEventArgs> rgbFrameHandler =
           (sender, args) =>
           {
               // Only proceed if we're not already 'busy' - i.e. we'
               if (Interlocked.CompareExchange(ref busyProcessingRgbFrame, 1, 0) == 0)
               {
                   Task.Run(
                       async () =>
                       {
                           using (var rgbFrame = rgbFrameReader.TryAcquireLatestFrame())
                           {
                               if ((rgbFrame != null) && (rgbFrame != lastRgbFrame))
                               {
                                   ++rgbProcessedCount;

                                   lastRgbFrame = rgbFrame;
                                   var facePosition = uVector3.zero;

                                   using (var videoFrame = rgbFrame.VideoMediaFrame.GetVideoFrame())
                                   {
                                       var faces = await faceTracker.ProcessNextFrameAsync(videoFrame);
                                       var firstFace = faces.FirstOrDefault();

                                       if (firstFace != null)
                                       {
                                           ++facesPresentCount;

                                           // Take the first face and the centre point of that face to try
                                           // and simplify things for my limited brain.
                                           var faceCentrePointInImageCoords =
                                              new Point(
                                                  firstFace.FaceBox.X + (firstFace.FaceBox.Width / 2.0f),
                                                  firstFace.FaceBox.Y + (firstFace.FaceBox.Height / 2.0f));

                                           wMatrix4x4 projectionTransform = wMatrix4x4.Identity;
                                           wMatrix4x4 viewTransform = wMatrix4x4.Identity;
                                           wMatrix4x4 invertedViewTransform = wMatrix4x4.Identity;

                                           var rgbCoordinateSystem = GetRgbFrameProjectionAndCoordinateSystemDetails(
                                               rgbFrame, out projectionTransform, out invertedViewTransform);

                                           // Scale the RGB point (1280x720)
                                           var faceCentrePointUnitScaleRGB = ScalePointMinusOneToOne(faceCentrePointInImageCoords, rgbFrame);

                                           // Unproject the RGB point back at unit depth as per the locatable camera
                                           // document.
                                           var unprojectedFaceCentrePointRGB = UnProjectVector(
                                                  new wVector3(
                                                      (float)faceCentrePointUnitScaleRGB.X,
                                                      (float)faceCentrePointUnitScaleRGB.Y,
                                                      1.0f),
                                                  projectionTransform);

                                           // Transfrom this back by the inverted view matrix in order to put this into
                                           // the RGB camera coordinate system
                                           var faceCentrePointCameraCoordsRGB =
                                                  wVector3.Transform(unprojectedFaceCentrePointRGB, invertedViewTransform);

                                           // Get the transform from the camera coordinate system to the Unity world
                                           // coordinate system, could probably cache this?
                                           var cameraRGBToWorldTransform =
                                                  rgbCoordinateSystem.TryGetTransformTo(unityWorldCoordinateSystem);

                                           if (cameraRGBToWorldTransform.HasValue)
                                           {
                                               // Transform to world coordinates
                                               var faceCentrePointWorldCoords = wVector4.Transform(
                                                      new wVector4(
                                                          faceCentrePointCameraCoordsRGB.X,
                                                          faceCentrePointCameraCoordsRGB.Y,
                                                          faceCentrePointCameraCoordsRGB.Z, 1),
                                                      cameraRGBToWorldTransform.Value);

                                               // Where's the camera in world coordinates?
                                               var cameraOriginWorldCoords = wVector4.Transform(
                                                      new wVector4(0, 0, 0, 1),
                                                      cameraRGBToWorldTransform.Value);

                                               // Multiply Z by -1 for Unity
                                               var cameraPoint = new uVector3(
                                                    cameraOriginWorldCoords.X,
                                                    cameraOriginWorldCoords.Y,
                                                    -1.0f * cameraOriginWorldCoords.Z);

                                               // Multiply Z by -1 for Unity
                                               var facePoint = new uVector3(
                                                      faceCentrePointWorldCoords.X,
                                                      faceCentrePointWorldCoords.Y,
                                                      -1.0f * faceCentrePointWorldCoords.Z);

                                               facePosition = 
                                                   cameraPoint + 
                                                   (facePoint - cameraPoint).normalized * centrePointDepthInMetres;
                                           }
                                       }
                                   }
                                   if (facePosition != uVector3.zero)
                                   {
                                       UnityEngine.WSA.Application.InvokeOnAppThread(
                                           () =>
                                           {
                                               this.faceMarker.transform.position = facePosition;
                                           },
                                           false
                                        );
                                   }
                               }
                           }
                           Interlocked.Exchange(ref busyProcessingRgbFrame, 0);
                       }
                   );
               }
               else
               {
                   Interlocked.Increment(ref rgbDroppedCount);
               }
               // NB: this is a bit naughty as I am accessing these counters across a few threads so
               // accuracy might suffer here.
               UnityEngine.WSA.Application.InvokeOnAppThread(
                   () =>
                   {
                       this.textMesh.text =
                           $"{depthFrameCount} depth,{rgbProcessedCount} rgb done, {rgbDroppedCount} rgb drop," +
                           $"{facesPresentCount} faces, ({centrePointDepthInMetres:N2})";
                   },
                   false);
           };

        depthFrameReader.FrameArrived += depthFrameHandler;
        rgbFrameReader.FrameArrived += rgbFrameHandler;

        await depthFrameReader.StartAsync();
        await rgbFrameReader.StartAsync();

        // Wait forever then dispose...just doing this to keep track of what needs disposing.
        await Task.Delay(-1);

        depthFrameReader.FrameArrived -= depthFrameHandler;
        rgbFrameReader.FrameArrived -= rgbFrameHandler;

        rgbFrameReader.Dispose();
        depthFrameReader.Dispose();

        rgbMediaCapture.Item1.Dispose();
        depthMediaCapture.Item1.Dispose();

        Marshal.ReleaseComObject(unityWorldCoordinateSystem);
    }


    static SpatialCoordinateSystem GetRgbFrameProjectionAndCoordinateSystemDetails(
        MediaFrameReference rgbFrame,
        out wMatrix4x4 projectionTransform,
        out wMatrix4x4 invertedViewTransform)
    {
        SpatialCoordinateSystem rgbCoordinateSystem = null;
        wMatrix4x4 viewTransform = wMatrix4x4.Identity;
        projectionTransform = wMatrix4x4.Identity;
        invertedViewTransform = wMatrix4x4.Identity;

        object value;

        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraCoordinateSystem, out value))
        {
            // I'm not sure that this coordinate system changes per-frame so I could maybe do this once?
            rgbCoordinateSystem = value as SpatialCoordinateSystem;
        }
        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraProjectionTransform, out value))
        {
            // I don't think that this transform changes per-frame so I could maybe do this once?
            projectionTransform = ByteArrayToMatrix(value as byte[]);
        }
        if (rgbFrame.Properties.TryGetValue(MFSampleExtension_Spatial_CameraViewTransform, out value))
        {
            // I think this transform changes per frame.
            viewTransform = ByteArrayToMatrix(value as byte[]);
            wMatrix4x4.Invert(viewTransform, out invertedViewTransform);
        }
        return (rgbCoordinateSystem);
    }
    /// <summary>
    /// Not using this right now as I don't *know* how to scale an RGB point to a depth point
    /// given that the depth frame seems to have a central 'hot spot' that's circular.
    /// </summary>
    static Point ScaleRgbPointToDepthPoint(Point rgbPoint, MediaFrameReference rgbFrame,
        MediaFrameReference depthFrame)
    {
        return (new Point(
            rgbPoint.X / rgbFrame.Format.VideoFormat.Width * depthFrame.Format.VideoFormat.Width,
            rgbPoint.Y / rgbFrame.Format.VideoFormat.Height * depthFrame.Format.VideoFormat.Height));
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// and hopefully without me breaking it too badly.
    /// </summary>
    static Point ScalePointMinusOneToOne(Point point, MediaFrameReference frameRef)
    {
        var scaledPoint = new Point(
            (2.0f * (float)point.X / frameRef.Format.VideoFormat.Width) - 1.0f,
            (2.0f * (1.0f - (float)point.Y / frameRef.Format.VideoFormat.Height)) - 1.0f);

        return (scaledPoint);
    }
    /// <summary>
    /// This code taken fairly literally from this doc
    /// https://docs.microsoft.com/en-us/windows/mixed-reality/locatable-camera#pixel-to-application-specified-coordinate-system
    /// but if it's got messed up in the translation then that's definitely my fault 🙂
    /// </summary>
    static wVector3 UnProjectVector(wVector3 from, wMatrix4x4 cameraProjection)
    {
        var to = new wVector3(0, 0, 0);

        var axsX = new wVector3(cameraProjection.M11, cameraProjection.M12, cameraProjection.M13);

        var axsY = new wVector3(cameraProjection.M21, cameraProjection.M22, cameraProjection.M23);

        var axsZ = new wVector3(cameraProjection.M31, cameraProjection.M32, cameraProjection.M33);

        to.Z = from.Z / axsZ.Z;
        to.Y = (from.Y - (to.Z * axsY.Z)) / axsY.Y;
        to.X = (from.X - (to.Z * axsX.Z)) / axsX.X;

        return to;
    }
    unsafe static float? GetDepthValueAtCoordinate(MediaFrameReference frame, int x, int y)
    {
        float? depthValue = null;

        var bitmap = frame.VideoMediaFrame.SoftwareBitmap;

        using (var buffer = bitmap.LockBuffer(BitmapBufferAccessMode.Read))
        using (var reference = buffer.CreateReference())
        {
            var description = buffer.GetPlaneDescription(0);

            byte* pBits;
            uint size;
            var byteAccess = reference as IMemoryBufferByteAccess;

            byteAccess.GetBuffer(out pBits, out size);

            // Try the pixel value itself and see if we get anything there.
            depthValue = GetDepthValueFromBufferAtXY(
                pBits, x, y, description, (float)frame.VideoMediaFrame.DepthMediaFrame.DepthFormat.DepthScaleInMeters);

#if HUNT_DEPTH_PIXEL_GRID
            if (depthValue == null)
            {
                // If we don't have a value, look for one in the surrounding space (the sub-function copes
                // with us using bad values of x,y).
                var minDistance = double.MaxValue;

                for (int i = 0 - DEPTH_SEARCH_GRID_SIZE; i < DEPTH_SEARCH_GRID_SIZE; i++)
                {
                    for (int j = 0 - DEPTH_SEARCH_GRID_SIZE; j < DEPTH_SEARCH_GRID_SIZE; j++)
                    {
                        var newX = x + i;
                        var newY = y + j;

                        var testValue = GetDepthValueFromBufferAtXY(
                            pBits,
                            newX,
                            newY,
                            description,
                            (float)frame.VideoMediaFrame.DepthMediaFrame.DepthFormat.DepthScaleInMeters);

                        if (testValue != null)
                        {
                            var distance =
                                Math.Sqrt(Math.Pow(newX - x, 2.0) + Math.Pow(newY - y, 2.0));

                            if (distance < minDistance)
                            {
                                depthValue = testValue;
                                minDistance = distance;
                            }
                        }
                    }
                }
            }
#endif // HUNT_DEPTH_PIXEL_GRID
        }
        return (depthValue);
    }
    unsafe static float? GetDepthValueFromBufferAtXY(byte* pBits, int x, int y, BitmapPlaneDescription desc,
        float scaleInMeters)
    {
        float? depthValue = null;

        var bytesPerPixel = desc.Stride / desc.Width;
        Debug.Assert(bytesPerPixel == Marshal.SizeOf<UInt16>());

        int offset = (desc.StartIndex + desc.Stride * y) + (x * bytesPerPixel);

        if ((offset > 0) && (offset < ((long)pBits + (desc.Height * desc.Stride))))
        {
            depthValue = *(UInt16*)(pBits + offset) * scaleInMeters;

            if (!IsValidDepthDistance((float)depthValue))
            {
                depthValue = null;
            }
        }
        return (depthValue);
    }
    static bool IsValidDepthDistance(float depthDistance)
    {
        // If that depth value is > 4.0m then we discard it because it seems like 
        // 4.**m (4.09?) comes back from the sensor when it hasn't really got a value
        return ((depthDistance > 0.5f) && (depthDistance <= 4.0f));
    }
    // Used an explicit tuple here as I'm in C# 6.0
    async Task<Tuple<MediaCapture, MediaFrameSource>> GetMediaCaptureForDescriptionAsync(
        MediaFrameSourceKind sourceKind,
        int width,
        int height,
        int frameRate,
        string[] bitmapFormats = null)
    {
        MediaCapture mediaCapture = null;
        MediaFrameSource frameSource = null;

        var allSources = await MediaFrameSourceGroup.FindAllAsync();

        // Ignore frame rate here on the description as both depth streams seem to tell me they are
        // 30fps whereas I don't think they are (from the docs) so I leave that to query later on.
        // NB: LastOrDefault here is a NASTY, NASTY hack - just my way of getting hold of the 
        // *LAST* depth stream rather than the *FIRST* because I'm assuming that the *LAST*
        // one is the longer distance stream rather than the short distance stream.
        // I should fix this and find a better way of choosing the right depth stream rather
        // than relying on some ordering that's not likely to always work!
        var sourceInfo =
            allSources.SelectMany(group => group.SourceInfos)
            .LastOrDefault(
                si =>
                    (si.MediaStreamType == MediaStreamType.VideoRecord) &&
                    (si.SourceKind == sourceKind) &&
                    (si.VideoProfileMediaDescription.Any(
                        desc =>
                            desc.Width == width &&
                            desc.Height == height &&
                            desc.FrameRate == frameRate)));

        if (sourceInfo != null)
        {
            var sourceGroup = sourceInfo.SourceGroup;

            mediaCapture = new MediaCapture();

            await mediaCapture.InitializeAsync(
               new MediaCaptureInitializationSettings()
               {
               // I want software bitmaps
               MemoryPreference = MediaCaptureMemoryPreference.Cpu,
                   SourceGroup = sourceGroup,
                   StreamingCaptureMode = StreamingCaptureMode.Video
               }
            );
            frameSource = mediaCapture.FrameSources[sourceInfo.Id];

            var selectedFormat = frameSource.SupportedFormats.First(
                format =>
                    format.VideoFormat.Width == width && format.VideoFormat.Height == height &&
                    format.FrameRate.Numerator / format.FrameRate.Denominator == frameRate &&
                    ((bitmapFormats == null) || (bitmapFormats.Contains(format.Subtype.ToLower()))));

            await frameSource.SetFormatAsync(selectedFormat);
        }
        return (Tuple.Create(mediaCapture, frameSource));
    }
    static wMatrix4x4 ByteArrayToMatrix(byte[] bits)
    {
        var matrix = wMatrix4x4.Identity;

        var handle = GCHandle.Alloc(bits, GCHandleType.Pinned);
        matrix = Marshal.PtrToStructure<wMatrix4x4>(handle.AddrOfPinnedObject());
        handle.Free();

        return (matrix);
    }
#if HUNT_DEPTH_PIXEL_GRID

    static readonly int DEPTH_SEARCH_GRID_SIZE = 32;

#endif // HUNT_DEPTH_PIXEL_GRID

    static readonly float MAGIC_DEPTH_FRAME_HEIGHT_RATIO_CENTRE = 0.25f;
    static readonly float MAGIC_DEPTH_FRAME_WIDTH_RATIO_CENTRE = 0.5f;
    static readonly Guid MFSampleExtension_Spatial_CameraCoordinateSystem = new Guid("9D13C82F-2199-4E67-91CD-D1A4181F2534");
    static readonly Guid MFSampleExtension_Spatial_CameraProjectionTransform = new Guid("47F9FCB5-2A02-4F26-A477-792FDF95886A");
    static readonly Guid MFSampleExtension_Spatial_CameraViewTransform = new Guid("4E251FA4-830F-4770-859A-4B8D99AA809B");

#endif // ENABLE_WINMD_SUPPORT
}

and it produces a sort of bouncing ball which (ideally) hovers around faces as shown in the screen capture below which makes it look better than it actually is at finding faces Winking smile

Sketch

with some on-screen diagnostics trying to show how many;

  • depth frames we have seen
  • RGB frames we have seen
  • RGB frames we have ignored because we were still processing the previous frame
  • frames we have seen which contained faces

along with the current depth value obtained from the ‘centre point’ of the camera which you’ll notice in the code is hard-coded to be a point 25% down the frame and 50% across – that’s just a ‘best guess’ right now rather than anything ‘scientific’.

Wrapping Up the Experiments for Now

I clearly need to spend some more time experimenting here as I haven’t quite got to the result that I wanted to but I learned quite a lot along the way even if my results might be a little flawed.

Through this post, I’ve been questioning my initial assumption that using the depth frames for estimating the Z-coordinate of a face was a good route to take.

Maybe that’s not right? Given that I get a long-range depth frame at 1fps and given that the depth data seems to be concentrated in one area of that frame, perhaps it doesn’t make sense to try to use the depth frame in this way to identify the Z-coordinate of a face (or other object in space). Maybe it’s better to go via the regular route of using the spatial mesh which the device builds so well after all?

I need to try a few more things out Smile

Code?

I haven’t published separate pieces of code for all of the experiments above but the Unity project that I have which brought some of them together in the last experiment is in this repo;

https://github.com/mtaulty/FacialDepthExperiments

Note that if you take the code and build it from Unity then you’ll need to mark the C# project assembly as allowing unsafe code before you’ll be able to get Visual Studio to build it – I can’t find a setting in Unity that seems to allow unsafe code in the separate C# project assembly rather than the main executable itself.

Note also that you will need to manually edit the .appxmanifest file to add the restricted persmission for perceptionSensorsExperimental as I wrote up in a previous post because there is no way (as far as I know) to set this in either the Unity or Visual Studio editors.

Lastly, apply a pinch of salt – I’m just experimenting here Smile

Rough Notes on UWP and webRTC (Part 4–Adding some Unity and a little HoloLens)

Following up on my previous post, I wanted to take the very basic test code that I’d got working ‘reasonably’ on UWP on my desktop PC and see if I could move it to HoloLens running inside of a Unity application.

The intention would be to preserve the very limited functionality that I have which goes something like;

  • The app runs up, is given the details of the signalling service (from the PeerCC sample) to connect to and it then connects to it
  • The app finds a peer on the signalling service and tries to get a two-way audio/video call going with that peer displaying local/remote video and capturing local audio while playing remote audio.

That’s what I currently have in the signalling branch here and the previous blog post was about abstracting some of that out such that I could use it in a different environment like Unity.

Now it’s time to see if that works out…

Getting Some Early Bits to Work With

In order to even think about this I needed to try and pick up a version of UWP webRTC that works in that environment and which has some extra pieces to help me out and, as far as I know, at the time of writing that involves picking up bits that are mentioned in this particular issue over on github by the UWP webRTC team;

Expose “raw” MediaCapture Object #11

and there are instructions in that post around how to get hold of some bits;

Instructions for Getting Bits

and so I followed those instructions and built the code from that branch of that repo.

From there, I’ve been working with my colleague Pete to put together some of those pieces with the pieces that I already had from the previous blog posts.

First, a quick look around the bits that the repo gives us…

Exploring the WebRtcUnity PeerCC Sample Solution

As is often the case, this process looks like it is going to involve standing on the shoulder of some other giants because there’s already code in the UWP webRTC repo that I pointed to above that shows how to put this type of app together.

The code in question is surfaced through this solution in the repo;

image

Inside of that solution, there’s a project which builds out the equivalent of original XAML+MediaElement PeerCC sample but in a modified way which here doesn’t have to use MediaElement to render and that shift in the code here is represented by its additional Unity dependencies;

image

This confused me for a little while – I was wondering why this XAML based application suddenly had a big dependency on Unity until I realised that what’s been done here to show that media can be rendered by Unity is that the original sample code has been modified such that (dependent on the conditional compilation constant UNITY) this app can either render media streams;

  1. Using MediaElement as it did previously
  2. Using Unity rendering pieces which are then hosted inside of a SwapChainPanel inside of the XAML UI.

Now, I’ve failed to get this sample to run on my machine which I think is down to the versions of Unity that I’m running and so I had to go through a process of picking through the code a little ‘cold’ but in so far as I can see what goes on here is that there are a couple of subprojects involved in making this work…

The Org.WebRtc.Uwp Project

This project was already present in the original XAML-based solution and in my mind this is involved with wrapping some C++/CX code around the webrtc.lib library in order to bring types into a UWP environment. I haven’t done a delta to try and see how much/little is different in this branch of this project over the original sample so there may be differences.

image

The MediaEngineUWP and WebRtcScheme Projects

Then there’s 2 projects within the Unity sample’s MediaEngine folder which I don’t think were present in the original purely XAML-based PeerCC sample;

image

The MediaEngineUWP and WebRtcScheme projects build out DLLs which seem to take on a couple of roles although I’m more than willing to admit that I don’t have this all worked out in my head at the time of writing but I think they are about bridging between the Unity platform, the Windows Media platform and webRTC and I think they do this by;

  • The existing work in the Org.WebRtc.Uwp project which integrates webRTC pieces into the Windows UWP media pipeline. I think this is done by adding a webRTC VideoSinkInterface which then surfaces the webRTC pieces as the UWP IMediaSource and IMediaStreamSource types.
  • The MediaEngineUWP.dll having an export UnityPluginLoad function which grabs an IUnityGraphics and offers a number of other exports that can be called via PInvoke from Unity to set up the textures for local/remote video rendering of video frames in Unity by code inside of this DLL.
    • There’s a class in this project named MediaEnginePlayer which is instanced per video stream and which seems to do the work of grabbing frames from the incoming Windows media pipeline and transferring them into Unity textures.
    • The same class looks to use the IMFMediaEngineNotify callback interface to be notified of state changes for the media stream and responds by playing/stopping etc.

The wiring together of this MediaEnginePlayer into the media pipeline is a little opaque to me but I think that it follows what is documented here and under the topic Source Resolver here. This seems to involve the code associating a URL (of form webrtc:GUID) with each IMediaStream and having an activatable class which the media pipeline then invokes with the URL to be linked up to the right instance of the player.

That may be a ‘much less than perfect’ description of what goes on in these projects as I haven’t stepped through all of that code.

What I think it does mean though is that the code inside of the WebRtcScheme project requires that the .appxmanifest for an app that consumes it needs to include a section that looks like;

 <Extensions>
    <Extension Category="windows.activatableClass.inProcessServer">
      <InProcessServer>
        <Path>WebRtcScheme.dll</Path>
        <ActivatableClass ActivatableClassId="WebRtcScheme.SchemeHandler" ThreadingModel="both" />
      </InProcessServer>
    </Extension>
  </Extensions>

I don’t know of a way of setting this up inside of a Unity project so I ended up just letting Unity build the Visual Studio solution and then I manually hack the manifest to include this section

Exploring the Video Control Solution

I looked into another project within that github repo which is a Unity project contained within this folder;

image

There’s a Unity scene which has a (UI) Canvas and a couple of Unity Raw Image objects which can be used to render to;

image

and a Control script which is set up to PInvoke into the MediaEngineUWP to pass the pieces from the Unity environment into the DLL. That script looks like this;

using System;
using System.Runtime.InteropServices;
using UnityEngine;
using UnityEngine.UI;

#if !UNITY_EDITOR
using Org.WebRtc;
using Windows.Media.Core;
#endif

public class ControlScript : MonoBehaviour
{
    public uint LocalTextureWidth = 160;
    public uint LocalTextureHeight = 120;
    public uint RemoteTextureWidth = 640;
    public uint RemoteTextureHeight = 480;
    
    public RawImage LocalVideoImage;
    public RawImage RemoteVideoImage;

	void Awake()
    {
    }
    
    void Start()
    {
	}

    private void OnInitialized()
    {
    }

    private void OnEnable()
    {
    }

    private void OnDisable()
    {
    }

    void Update()
    {
    }

    public void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(LocalTextureWidth, LocalTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)LocalTextureWidth, (int)LocalTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        LocalVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadLocalMediaStreamSource((MediaStreamSource)source);
        Plugin.LocalPlay();
#endif
    }

    public void DestroyLocalMediaStreamSource()
    {
        LocalVideoImage.texture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    public void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();
        IntPtr nativeTex = IntPtr.Zero;
        Plugin.GetRemotePrimaryTexture(RemoteTextureWidth, RemoteTextureHeight, out nativeTex);
        var primaryPlaybackTexture = Texture2D.CreateExternalTexture((int)RemoteTextureWidth, (int)RemoteTextureHeight, TextureFormat.BGRA32, false, false, nativeTex);
        RemoteVideoImage.texture = primaryPlaybackTexture;
#if !UNITY_EDITOR
        MediaVideoTrack videoTrack = (MediaVideoTrack)track;
        var source = Media.CreateMedia().CreateMediaStreamSource(videoTrack, type, id);
        Plugin.LoadRemoteMediaStreamSource((MediaStreamSource)source);
        Plugin.RemotePlay();
#endif
    }

    public void DestroyRemoteMediaStreamSource()
    {
        RemoteVideoImage.texture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }

    private static class Plugin
    {
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateLocalMediaPlayback")]
        internal static extern void CreateLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "CreateRemoteMediaPlayback")]
        internal static extern void CreateRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseLocalMediaPlayback")]
        internal static extern void ReleaseLocalMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "ReleaseRemoteMediaPlayback")]
        internal static extern void ReleaseRemoteMediaPlayback();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetLocalPrimaryTexture")]
        internal static extern void GetLocalPrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "GetRemotePrimaryTexture")]
        internal static extern void GetRemotePrimaryTexture(UInt32 width, UInt32 height, out System.IntPtr playbackTexture);

#if !UNITY_EDITOR
        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadLocalMediaStreamSource")]
        internal static extern void LoadLocalMediaStreamSource(MediaStreamSource IMediaSourceHandler);

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LoadRemoteMediaStreamSource")]
        internal static extern void LoadRemoteMediaStreamSource(MediaStreamSource IMediaSourceHandler);
#endif

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPlay")]
        internal static extern void LocalPlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePlay")]
        internal static extern void RemotePlay();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "LocalPause")]
        internal static extern void LocalPause();

        [DllImport("MediaEngineUWP", CallingConvention = CallingConvention.StdCall, EntryPoint = "RemotePause")]
        internal static extern void RemotePause();
    }
}

and so it’s essentially giving me the pieces that I need to wire up local/remote media streams coming from webRTC into the pieces that can render them in Unity.

If feels like across these projects are the pieces needed to plug together with my basic library project in order to rebuild the app that I had in the previous blog post and have it run inside of a 3D Unity app rather than a 2D XAML app…

Plugging Together the Pieces

Pete put together a regular Unity project targeting UWP for HoloLens and in the scene at the moment we have only 2 quads that we try to render the local and remote video to.

image

and then there’s an empty GameObject named Control with a script on it configured as below;

image

and you can see that this configuration is being used to do a couple of things;

  • Set up the properties that my conversation library code from the previous blog post needed to try and start a conversation over webRTC
    • The signalling server IP address, port number, whether to initiate a conversation or not and, if so, whether there’s a particular peer name to initiate that conversation with.
  • Set up some properties that will facilitate rendering of the video into the materials texturing the 2 quads in the scene.
    • Widths, heights to use.
    • The GameObjects that we want to render our video streams to.

Pete re-worked the original sample code to render to a texture of a material applied to a quad rather than the original rendering to a 2D RawImage.

Now, it’s fairly easy to then add my conversation library into this Unity project so that we can make use of that code. We simply drop it into the Assets of the project and configure up the appropriate build settings for Unity;

image

and also drop in the MediaEngineUWP, Org.WebRtc.dll and WebRtcScheme.dlls;

image

and the job then becomes one of adapting the code that I wrote in the previous blog post to suit the Unity environment which means being able to implement the IMediaManager interface that I came up with for Unity rather than for XAML.

How to go about that? Firstly, We took those PInvoke signatures from the VideoControlSample and put them into a separate static class named Plugin.

Secondly, we implemented that IMediaManager interface on top of the pieces that originated in the sample;

#if ENABLE_WINMD_SUPPORT

using ConversationLibrary.Interfaces;
using ConversationLibrary.Utility;
using Org.WebRtc;
using System;
using System.Linq;
using System.Threading.Tasks;
using UnityEngine;
using UnityEngine.WSA;
using Windows.Media.Core;

public class MediaManager : IMediaManager
{
    // This constructor will be used by the cheap IoC container
    public MediaManager()
    {
        this.textureDetails = CheapContainer.Resolve<ITextureDetailsProvider>();
    }
    // The idea is that this constructor would be used by a real IoC container.
    public MediaManager(ITextureDetailsProvider textureDetails)
    {
        this.textureDetails = textureDetails;
    }
    public Media Media => this.media;

    public MediaStream UserMedia => this.userMedia;

    public MediaVideoTrack RemoteVideoTrack { get => remoteVideoTrack; set => remoteVideoTrack = value; }

    public async Task AddLocalStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateLocalMediaStreamSource(track, LOCAL_VIDEO_FRAME_FORMAT, "SELF"));
        }
    }

    public async Task AddRemoteStreamAsync(MediaStream stream)
    {
        var track = stream?.GetVideoTracks()?.FirstOrDefault();

        if (track != null)
        {
            // TODO: stop hardcoding I420?.
            this.InvokeOnUnityMainThread(
                () => this.CreateRemoteMediaStreamSource(track, REMOTE_VIDEO_FRAME_FORMAT, "PEER"));
        }
    }
    void InvokeOnUnityMainThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnAppThread(callback,false);
    }
    void InvokeOnUnityUIThread(AppCallbackItem callback)
    {
        UnityEngine.WSA.Application.InvokeOnUIThread(callback, false);
    }
    public async Task CreateAsync(bool audioEnabled = true, bool videoEnabled = true)
    {
        this.media = Media.CreateMedia();

        // TODO: for the moment, turning audio off as I get an access violation in
        // some piece of code that'll take some debugging.
        RTCMediaStreamConstraints constraints = new RTCMediaStreamConstraints()
        {
            // TODO: switch audio back on, fix the crash.
            audioEnabled = false,
            videoEnabled = true
        };
        this.userMedia = await media.GetUserMedia(constraints);
    }

    public void RemoveLocalStream()
    {
        // TODO: is this ever getting called?
        this.InvokeOnUnityMainThread(
            () => this.DestroyLocalMediaStreamSource());
    }

    public void RemoveRemoteStream()
    {
        this.DestroyRemoteMediaStreamSource();
    }

    public void Shutdown()
    {
        if (this.media != null)
        {
            if (this.localVideoTrack != null)
            {
                this.localVideoTrack.Dispose();
                this.localVideoTrack = null;
            }
            if (this.RemoteVideoTrack != null)
            {
                this.RemoteVideoTrack.Dispose();
                this.RemoteVideoTrack = null;
            }
            this.userMedia = null;
            this.media.Dispose();
            this.media = null;
        }
    }
    void CreateLocalMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateLocalMediaPlayback();
        IntPtr playbackTexture = IntPtr.Zero;
        Plugin.GetLocalPrimaryTexture(
            this.textureDetails.Details.LocalTextureWidth, 
            this.textureDetails.Details.LocalTextureHeight, 
            out playbackTexture);

        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = 
            (Texture)Texture2D.CreateExternalTexture(
                (int)this.textureDetails.Details.LocalTextureWidth, 
                (int)this.textureDetails.Details.LocalTextureHeight, 
                (TextureFormat)14, false, false, playbackTexture);

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadLocalMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.LocalPlay();
    }

    void DestroyLocalMediaStreamSource()
    {
        this.textureDetails.Details.LocalTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseLocalMediaPlayback();
    }

    void CreateRemoteMediaStreamSource(object track, string type, string id)
    {
        Plugin.CreateRemoteMediaPlayback();

        IntPtr playbackTexture = IntPtr.Zero;

        Plugin.GetRemotePrimaryTexture(
            this.textureDetails.Details.RemoteTextureWidth, 
            this.textureDetails.Details.RemoteTextureHeight, 
            out playbackTexture);

        // NB: creating textures and calling GetComponent<> has thread affinity for Unity
        // in so far as I can tell.
        var texture = (Texture)Texture2D.CreateExternalTexture(
           (int)this.textureDetails.Details.RemoteTextureWidth,
           (int)this.textureDetails.Details.RemoteTextureHeight,
           (TextureFormat)14, false, false, playbackTexture);

        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = texture;

#if ENABLE_WINMD_SUPPORT
        Plugin.LoadRemoteMediaStreamSource(
            (MediaStreamSource)Org.WebRtc.Media.CreateMedia().CreateMediaStreamSource((MediaVideoTrack)track, type, id));
#endif
        Plugin.RemotePlay();
    }

    void DestroyRemoteMediaStreamSource()
    {
        this.textureDetails.Details.RemoteTexture.GetComponent<Renderer>().sharedMaterial.mainTexture = null;
        Plugin.ReleaseRemoteMediaPlayback();
    }
    Media media;
    MediaStream userMedia;
    MediaVideoTrack remoteVideoTrack;
    MediaVideoTrack localVideoTrack;
    ITextureDetailsProvider textureDetails;

    // TODO: temporary hard coding...
    static readonly string LOCAL_VIDEO_FRAME_FORMAT = "I420";
    static readonly string REMOTE_VIDEO_FRAME_FORMAT = "H264";
}
#endif

Naturally, this is very “rough” code right now and there’s some hard-coding going on in there but it didn’t take too much effort to plug these pieces under that interface that I’d brought across from my original, minimal XAML-based project.

So…with all of that said…

Does It Work?

Sort of Smile Firstly, you might notice in the code above that audio is hard-coded to be switched off because we currently have a crash if we switch audio on and it’s some release of some smart pointer in the webRTC pieces that we haven’t yet tracked down.

Minus audio, it’s possible to run the Unity app here on HoloLens and have it connect via the sample-provided signalling service to the original XAML-based PeerCC sample running (e.g.) on my Surface Book and video streams flow and are visible in both directions.

Here’s a screenshot of that “in action” from the point of view of the desktop app receiving video stream from HoloLens;

image

and that screenshot is display 4 things;

  • Bottom right is the local PC’s video stream off its webcam – me wearing a HoloLens.
  • Upper left 75% is the remote stream coming from the webcam on the HoloLens including its holographic content which currently includes;
    • Upper left mid section is the remote video stream from the PC replayed on the HoloLens.
    • Upper right mid section is the local HoloLens video stream replayed on the HoloLens which looked to disappear when I was taking this screenshot.

You might see some numbers in there that suggest 30fps but I think that was a temporary thing and at the time of writing the performance so far is fairly bad but we’ve not had any look at what’s going on there just yet – this ‘play’ sample needs some more investigation.

Where’s the Code?

If you’re interested in following these experiments along as we go forward then the code is in a different location to the previous repo as it’s over here on Pete’s github account;

https://github.com/peted70/web-rtc-test

Feel free to feedback but, of course, apply the massive caveats that this is very rough experimentation at the moment – there’s a long way to go Smile

Experiments with Shared Holograms and Azure Blob Storage/UDP Multicasting (Part 7)

NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should always check out the official developer site for the product documentation.

A follow-up to my previous post around experiments with shared holograms using Azure blob storage and UDP multicasting techniques.

At the end of the previous post, I said that I might return and make a slightly better ‘test scene’ for the Unity project this post is my write up of my attempts to do that.

What’s in the New Test Scene?

I found a model of a house on Remix3D.com;

image

and I made the test scene about visualising that model in a consistent place on multiple devices with the ability to rotate, scale and move it such that the multiple devices keep a consistent view.

What I built is pretty simple and the essential steps involved in the scene are;

  • The app runs and waits for the underlying library to tell it whether there are already other devices on the same network or not. During this period, it displays a ‘waiting screen’ for up to 5 seconds if it doesn’t receive notification that there are other devices on the network.

20180110_130146_HoloLens

  • If the app determines that no-other devices are on the network then it pops up a model of a house gaze-locked to the device so that the user can potentially move it around and say ‘done’ to place it.

20180110_125124_HoloLens

  • Once positioned, the app replaces the model displayed by using the APIs detailed in the previous posts to create a shared hologram which is exactly the same as the house and in the same position etc. At this point, its creation will be multicast around the network and the blob representing its world anchor will be uploaded to Azure.
  • If the app determines that there are other devices on the network at start-up time then it will inform the user of this;

20180110_125554_HoloLens

  • and it will stop the user from positioning the model while waiting to bring the position data (world anchor) from Azure. The same thing should happen in the race condition where multiple users start the app at the same time and then one of them becomes the first to actually position the model.

20180110_125733_HoloLens

  • Once the model has been positioned on the local device (in whichever way) it enters into a mode which allows for voice commands to be used to enter ‘rotate’, ‘scale’ and ‘move’ modes to move it around;

20180110_125155_HoloLens

  • those transformations are then multicast to other devices on the network such that they all display the same model of a house in the same place.

and that’s pretty much it Smile

How’s the Test Scene Structured?

I already had a test scene within the Unity project that I’d published to github and so I just altered it rather than starting from scratch.

It’s very simple – the scene starts with the main camera parenting both a text object (to give a very poor Heads-Up-Display) and the model of the house (to give a very poor gaze-locked positioning system) as below;

image

there is then one object called ScriptHolder which has an instance of the Shared Hologram Controller component (and its dependency) that I discussed in the previous posts;

image

I’ve ommitted the details of my own Azure configuration so that would need to be filled in to specify the storage details and I’ve also told the script that I want to synchronise transforms on a fairly high frequency which, realistically, I think I could drop down a little.

Beyond that, I also have a script here called Main Script which contains the logic for the scene with the positive part of it being that there’s not too much of it;

using SharedHolograms;
using System;
using System.Linq;
using UnityEngine;
using UnityEngine.Windows.Speech;

public class MainScript : MonoBehaviour, ICreateGameObjects
{
    // Text to display output messages on
    public TextMesh StatusDisplayTextMesh;

    // GameObject to use as a marker to position the model (i.e. the house)
    public GameObject PositionalModel;

    // Implementation of ICreateGameObject - because we are not creating a Unity primitive
    // I've implemented this here and 'plugged it in' but our creation is very simple in
    // that we duplicate the object that we're using as the PositionalModel (i.e. the
    // house in my version).
    public void CreateGameObject(string gameObjectSpecifier, Action<GameObject> callback)
    {
        // Right now, we know how to create one type of thing and we do it in the most
        // obvious way but we could do it any which way we like and even get some other
        // componentry to do it for us.
        if (gameObjectSpecifier == "house")
        {
            var gameObject = GameObject.Instantiate(this.PositionalModel);
            gameObject.SetActive(true);
            callback(gameObject);
        }
        else
        {
            // Sorry, only know about "house" right now.
            callback(null);
        }
    }
    void Start()
    {
        // Set up our keyword handling. Originally, I imagined more than one keyword but
        // we ended up just with "Done" here.
        var keywords = new[]
        {
            new { Keyword = "done", Handler = (Action)this.OnDoneKeyword }
        };
        this.keywordRecognizer = new KeywordRecognizer(keywords.Select(k => k.Keyword).ToArray());

        this.keywordRecognizer.OnPhraseRecognized += (e) =>
        {
            var understood = false;

            if ((e.confidence == ConfidenceLevel.High) ||
                (e.confidence == ConfidenceLevel.Medium))
            {
                var handler = keywords.FirstOrDefault(k => k.Keyword == e.text.ToLower());

                if (handler != null)
                {
                    handler.Handler();
                    understood = true;
                }
            }
            if (!understood)
            {
                this.SetStatusDisplayText("I might have missed what you said...");
            }
        };
        // We need to know when various things happen with the shared holograms controller.
        SharedHologramsController.Instance.SceneReady += OnSceneReady;
        SharedHologramsController.Instance.Creator.BusyStatusChanged += OnBusyStatusChanged;
        SharedHologramsController.Instance.Creator.HologramCreatedRemotely += OnRemoteHologramCreated;
        SharedHologramsController.Instance.Creator.GameObjectCreator = this;

        // Wait to see whether we should make the positional model active or not.
        this.PositionalModel.SetActive(false);
        this.SetStatusDisplayText("waiting...");
    }
    void OnDoneKeyword()
    {
        if (!this.busy)
        {
            this.keywordRecognizer.Stop();

            this.SetStatusDisplayText("working, please wait...");

            if (this.PositionalModel.activeInHierarchy)
            {
                // Get rid of the placeholder.
                this.PositionalModel.SetActive(false);

                // Create the shared hologram in the same place as the placeholder.
                SharedHologramsController.Instance.Creator.Create(
                    "house",
                    this.PositionalModel.transform.position,
                    this.PositionalModel.transform.forward,
                    Vector3.one,
                    gameObject =>
                    {
                        this.SetStatusDisplayText("object created and shared");
                        this.houseGameObject = gameObject;
                        this.AddManipulations();
                    }
                );
            }
        }
    }
    void OnBusyStatusChanged(object sender, BusyStatusChangedEventArgs e)
    {
        this.busy = e.Busy;

        if (e.Busy)
        {
            this.SetStatusDisplayText("working, please wait...");
        }
    }
    void OnSceneReady(object sender, SceneReadyEventArgs e)
    {
        // Are there other devices around or are we starting alone?
        if (e.Status == SceneReadyStatus.OtherDevicesInScene)
        {
            this.SetStatusDisplayText("detected other devices, requesting sync...");
        }
        else
        {
            this.SetStatusDisplayText("detected no other devices...");

            // We need this user to position the model so switch it on
            this.PositionalModel.SetActive(true);
            this.SetStatusDisplayText("walk to position the house then say 'done'");

            // Wait for the 'done' keyword.
            this.keywordRecognizer.Start();
        }
    }
    void OnRemoteHologramCreated(object sender, HologramEventArgs e)
    {
        // Someone has beaten this user to positioning the model
        // turn off the model.
        this.PositionalModel.SetActive(false);

        this.SetStatusDisplayText("sync'd...");

        // Stop waiting for the 'done' keyword (if we are)
        this.keywordRecognizer.Stop();

        this.houseGameObject = GameObject.Find(e.ObjectId.ToString());

        // Make sure we can manipulate what the other user has placed.
        this.AddManipulations();
    }
    void AddManipulations()
    {
        this.SetStatusDisplayText("say 'move', 'rotate' or 'scale'");

        // The Manipulations script contains a keyword recognizer for 'move', 'rotate', 'scale'
        // and some basic logic to wire those to hand manipulations
        this.houseGameObject.AddComponent<Manipulations>();
    }
    void SetStatusDisplayText(string text)
    {
        if (this.StatusDisplayTextMesh != null)
        {
            this.StatusDisplayTextMesh.text = text;
        }
    }
    KeywordRecognizer keywordRecognizer;
    GameObject houseGameObject;
    bool busy;
}

if someone (anyone! please! please! Winking smile) had been following the previous set of blog scripts closely they might have noticed that in order to write that code I had to change my existing code to at least;

  • Fire an event when the device joins the network such that code can be notified of whether the messaging layer has seen other devices on the network or not.
  • Fire events when other devices on the network create/delete holograms causing them to be imported and created by the local device.
  • Fire an event as/when the underlying code is ‘busy’ doing some downloading or uploading or similar.

Having tried to implement this scene it was immediately obvious to me that this was needed but it wasn’t so obvious to me that I implemented those pieces beforehand and so that was a useful output of writing this test scene.

The other thing that’s used in the scene is a MonoBehaviour named Manipulations. This is a version of a script that I’ve used in a few places in the past and it’s a very cheap and cheerful way to provide rotate/scale/move behaviour on a focused object in response to voice commands and hand manipulations.

I placed this script and the other script that is specific to the test scene in the ‘Scene Specific’ folder;

image

and the Manipulations script has a dependency on the 3 materials in the Resources folder that it uses for drawing different coloured boxed around an object while it is being rotated/scaled/moved;

image

and that’s pretty much it.

One thing that I’d note is that when I’d used this Manipulations scripts before it was always in projects that were making use of the Mixed Reality Toolkit for Unity and, consequently, I had written the code to depend on some items of the toolkit – specifically around the IManipulationHandler interface and the IInputClickHandler interface.

I don’t currently have any use of the toolkit in this test project and it felt like massive overkill to add it just to enable this one script and so I reworked the script to move it away from having a dependency on the toolkit and I was very pleased to find that this was only a small piece of work – i.e. the toolkit had mostly done a bit of wrapping on the raw Unity APIs and so it wasn’t difficult to unpick that dependency here.

Wrapping Up

I don’t intend to write any more posts in this mini-series around using Azure blob storage and UDP multicasting to enable shared holograms, I think I’ve perhaps gone far enough Smile

The code is all up on github should anyone want to explore it, try it, take some pieces for their own means.

I’m always open to feedback so feel free to do that if you want to drop me a line and be aware that I’ve only tested this code in a limited way as I wrote it all on a single HoloLens device using the (supplied) test programs to simulate responses from a second device but I’m ‘reasonably’ happy that it’s doing sensible things.