NB: The usual blog disclaimer for this site applies to posts around HoloLens. I am not on the HoloLens team. I have no details on HoloLens or Azure Mixed Reality other than what is on the public web and so what I post here is just from my own experience experimenting with pieces that are publicly available and you should alwaysย check out the official developer siteย for the product documentation.
This is purely experimental but after writing yesterday’s post I was thinking about how this notion of scene understanding can potentially do all kinds of things for your application and so I wanted to experiment a little more.
Returning to ‘Spatial Understanding’
As an example – one of the earliest pieces of ‘magic’ that I saw on the original HoloLens device was in the application ‘Fragments’ where there are scenes that make use of the furniture within a room. For example, in the scene below the application ‘magically’ seats a character on a real-world sofa or chair;

Now, a few points about this;
- The ‘magic’ part here is that the user of the application does not have to tell the application that there is a sofa or anything like that. The application works it out on its own.
- ‘Working this out’ means operating on the spatial mesh from the HoloLens to determine ‘surfaces’ or ‘planes’.
- That sort of functionality (‘spatial understanding‘) was open sourced in a library and many developers used it in their applications.
but, most importantly, the library in question was working in software on the device & required a ‘scanning’ phase where you first showed it the space that you were going to work in. Any time you see this library in use you’ll see a ‘setup phase’ in order to bootstrap the library’s algorithms.
So, while the device had an innate ability to understand the space and create a mesh from it, the ability to work at the higher level abstraction of ‘surfaces’, ‘planes’, ‘walls’ etc. was left to the application developer and this library was often used to do the heavy lifting.
From the user’s point of view, this can take a minute or two to walk a space and be given enough direction to ensure that enough space is scanned for an application to operate.
But, nonetheless, that ‘magic’ moment is very much there when you realise that the device has figured out your space to the extent that it can put holograms onto your walls, floor and maybe table & chairs ๐
Scene Understanding
Going back to the ‘Scene Understanding SDK’ that I experimented with in yesterday’s post, things are different .
As the docs clearly state, there’s a ‘scene understanding runtime’ already on the device;

and there is a cost to invoking this and having it try to turn spatial mesh into ‘scene objects’;
The process of converting the raw sensor data into a Scene is a potentially expensive operation that could take seconds for medium spaces (~10x10m) to minutes for very large spaces (~50x50m) and therefore it is not something that is being computed by the device without application request
a fundamental difference to me is the statement that;
On the left hand side is a diagram of the mixed reality runtime which is always on and running in its own process. This runtime is responsible for performing device tracking, surface reconstruction, and other operations that Scene Understanding uses to understand and reason about the world around you
So, there’s a runtime which is always running and which has access to the sensor data and so there is no need to have the user bootstrap these scene understanding algorithms by artificially asking them to map out the room as they would have done with the older ‘spatial understanding’ mechanism.
Thus, this understanding of space in terms of ‘surfaces’, ‘planes’, ‘walls’, ‘platforms’ becomes an an innate ability of the device & the developer consumes the data rather than having to rely on their own or 3rd party algorithms.
Having said that, it’s worth considering that the device can’t reason about things that it hasn’t ever seen – e.g. if you take a device and turn it on for the first time in a room where all surfaces are 10m away then the device isn’t going to be able to do much about finding walls and tables until it has taken a look around.
That’s an extreme case though – for regular use, you’d imagine that where the user has entered a room, put on a device and then got to the point of running an app it’s likely that the device has already seen a reasonable portion of that room.
The Largest Table…
I wanted to try out a little of this ‘magic’ for myself and the simple scenario that I conjured up was the idea that the device could initially place a piece of content for the user in the centre of the largest table in the room.
I’ve seen lots of Mixed Reality applications where a piece of holographic content (e.g. interactive map, architectural model, car engine model, etc) is being viewed in a business setting like a meeting room and the first step in that application is to;
Place the content!
and it would be ‘nice’ if the application could perhaps make a highly educated guess by automatically placing the content on the big, obvious meeting room table that’s in the centre of the room;

Naturally, an application would get additional bonus points for allowing the user to re-position the content if they didn’t want it in the middle of the table and an application could get double-bonus-points for anchoring the content for stability and super-double-bonus-points if that anchor was persisted such that the application could remember where the content was placed for next time around (perhaps x-device using Azure Spatial Anchors).
So, what would it look like to sketch out the basics of placing a hologram in the centre of the largest table of the room? I went back to my project from yesterday to add another scene and try it out.
Adding Another Scene
I made a quick branch, added another scene to my project and renamed the original scene so that anyone who looked at these 2 blog posts in the future might have a clue what I was doing;

and then I’m not worried about multi-scenes here as I’m only ever going to build one scene at a time so I moved to that second scene and added in the MixedRealityToolkit and fed it the DefaultHoloLens2ConfigurationProfile for now;

I wanted some ‘content’ so I went off and found a 3D model of an office from the soon-to-be-closing-down remix3D, brought it into Paint3D, saved without a canvas as FBX and imported it into Unity where I scaled it, applied the legacy material option & then made a prefab out of it at the origin as below;

I then applied the toolkit scripts ManipulationHandler, NearInteractionGrabbable, BoundingBox to my prefab instance;

and gave that a quick try out in the editor to ensure that I could translate and rotate the office block (I could). I didn’t try and tune any settings here, this is all defaults.

Ok, I’ve got a scene with an office block positioned at the origin and I can then move that office block. How about positioning it at the middle of the largest table in the room by default?
I have no idea how to do that…but I don’t see why that should get in the way of experimenting…
Adding a Simple ‘Largest Platform’ Behaviour
I cooked up a quick MonoBehaviour and added it to my office block object as below;

Naturally, this could offer a tonne of configurable parameters rather than just being ‘largest platform’ but I’m just experimenting here. That behaviour looks like this;
using Microsoft.MixedReality.Toolkit.Utilities;
using UnityEngine;
public class LargePlatformPositioningBehaviour : MonoBehaviour
{
// Naturally, this could have parameters for things like;
// 1) the type of object to look for (wall, platform, etc)
// 2) the search radius
// 3) UI to display while positioning
// 4) UI to display if positioning can't be done
// 5) the minimum size of the object to look for
// etc. etc.
// All I've done so far is to assume no UI and 'position on the largest platform'
// and that implementation is sketchy.
async void Update()
{
if (!this.positionAttempted)
{
this.positionAttempted = true;
#if ENABLE_WINMD_SUPPORT
var canCompute = await SceneUnderstandingHelper.CanComputeAsync();
if (canCompute)
{
var parent = await SceneUnderstandingHelper.ParentGameObjectOnLargestPlatformAsync(this.gameObject);
// Not yet sure whether I should be checking the orientation of the platform
// that we have found here and then rotating based upon it but, for the moment
// I'm going to say that this model should face the user and should not be rotated
// around x,z so that it (hopefully) sits flat on the platform in question.
var lookPos = CameraCache.Main.transform.position;
lookPos.y = this.gameObject.transform.position.y;
this.gameObject.transform.LookAt(lookPos);
}
#endif
}
}
bool positionAttempted = false;
}
and that then relies on a SceneUnderstandingHelper class which I wrote as below to bring together a few of the fragments of code that I had in the previous blog post;
using NumericsConversion;
using System;
using System.Threading.Tasks;
using UnityEngine;
#if ENABLE_WINMD_SUPPORT
using Microsoft.MixedReality.SceneUnderstanding;
internal static class SceneUnderstandingHelper
{
internal async static Task<bool> CanComputeAsync()
{
if (!canCompute.HasValue)
{
canCompute = SceneObserver.IsSupported();
if ((bool)canCompute)
{
var access = await SceneObserver.RequestAccessAsync();
canCompute = access == SceneObserverAccessStatus.Allowed;
}
}
return ((bool)canCompute);
}
internal async static Task<GameObject> ParentGameObjectOnLargestPlatformAsync(GameObject gameObject,
float searchRadius = 3.0f)
{
GameObject parent = null;
var querySettings = new SceneQuerySettings()
{
EnableWorldMesh = false,
EnableSceneObjectQuads = true,
EnableSceneObjectMeshes = false,
EnableOnlyObservedSceneObjects = false
};
var scene = await SceneObserver.ComputeAsync(querySettings, searchRadius);
if (scene != null)
{
// Note - we are taking the position of the 'largest' (by area) scene object
// of type platform here by looking at the quads that make it up.
// We might need to, instead, query those quads & find their centre position
// via the FindCentermostPlacement() method and then somehow coalesce those
// positions to come up with a position instead. i.e. not sure this is at all
// the 'right' thing to do in terms of coming up with a position.
var largestPlatform = scene.LargestSceneObjectOfType(SceneObjectKind.Platform);
if (largestPlatform != null)
{
// Where would this platform be in Unity space?
var unityTransform = scene.GetUnityTransform();
if (unityTransform.HasValue)
{
parent = new GameObject();
parent.transform.SetPositionAndRotation(
unityTransform.Value.GetColumn(3), unityTransform.Value.rotation);
gameObject.transform.SetParent(parent.transform, false);
gameObject.transform.localPosition = largestPlatform.Position.ToUnity();
gameObject.transform.localRotation = largestPlatform.Orientation.ToUnity();
}
}
}
return (parent);
}
static bool? canCompute = null;
}
#endif
and, in turn, that relies on a few extension methods that I added to some of the Scene Understanding SDK classes;
using System.Collections.Generic;
using UnityEngine;
using System.Linq;
using System.Runtime.InteropServices;
using NumericsConversion;
#if ENABLE_WINMD_SUPPORT
using Microsoft.MixedReality.SceneUnderstanding;
using Windows.Perception.Spatial.Preview;
using Windows.Perception.Spatial;
using UnityEngine.XR.WSA;
internal static class SceneUnderstandingExtensions
{
internal static IEnumerable<SceneObject> SceneObjectsOfType(this Scene scene, SceneObjectKind kind)
{
return (scene.SceneObjects.Where(so => so.Kind == kind));
}
internal static float Area(this SceneQuad sceneQuad)
{
return (sceneQuad.Extents.X * sceneQuad.Extents.Y);
}
internal static float QuadArea(this SceneObject sceneObject)
{
return (sceneObject.Quads.Sum(q => q.Area()));
}
internal static SceneObject LargestSceneObjectOfType(this Scene scene, SceneObjectKind kind)
{
var objectsOfKind = scene.SceneObjectsOfType(kind);
// MaxBy is what I want really.
// See https://stackoverflow.com/questions/1101841/how-to-perform-max-on-a-property-of-all-objects-in-a-collection-and-return-th
return (objectsOfKind.OrderByDescending(s => s.QuadArea()).FirstOrDefault());
}
internal static Matrix4x4? GetUnityTransform(this Scene scene)
{
Matrix4x4? transform = null;
var sceneCoordSystem = SpatialGraphInteropPreview.CreateCoordinateSystemForNode(scene.OriginSpatialGraphNodeId);
var unityCoordSystem =
(SpatialCoordinateSystem)System.Runtime.InteropServices.Marshal.GetObjectForIUnknown(
WorldManager.GetNativeISpatialCoordinateSystemPtr());
var unityTransform = sceneCoordSystem.TryGetTransformTo(unityCoordSystem);
if (unityTransform.HasValue)
{
transform = unityTransform.Value.ToUnity();
}
// TODO: Am I supposed to Release() this or not?
Marshal.ReleaseComObject(unityCoordSystem);
return (transform);
}
}
#endif
and that’s pretty much it.
Trying it out in a Small Space
If I try this on a device then, as you’d expect, what initially happens is that the application runs up and loads the model of the office at the origin (0,0,0) – i.e. right on top of my head which is never pleasant ๐ but then it quickly jumps over to place itself on the largest platform in my room.
Here’s a screenshot from how that looks in my small home office;

It’s kind of a small piece of ‘magic’ ๐ but it’s far quicker to have the device position this content here in a ‘sensible place’ and then make minor adjustments than have the content just appear (e.g.) 2m in front of my head and then have to adjust it from there.
It still needs a lot of consideration though – e.g. the app will position the model here on this desk even if I’m looking in the opposite direction so there’d need to be some kind of wayfinding UX to tell me where the model had just gone (“it’s behind you!”) and, equally, UX to handle the scenario where the model can’t be found.
It’s perhaps also worth thinking that my code is currently running some very basic (and probably wrong) tests to find the ‘largest platform’ and then assuming that’s a table. It might not be and perhaps some more tests (e.g. size, height, etc.) could be added to try and determine whether it’s likely a table that we’re choosing.
A Form of Spatial Anchor?
At the moment, my positioning logic involves trying to find the ‘largest’ SceneObject of type ‘platform’ within a radius (hard-coded to 3.0m).
In doing that, I try to determine ‘largest’ by figuring out the sums of the areas of the SceneQuad objects that make up any SceneObjects of type ‘platform’ and then choosing the largest.
Once I’ve found that SceneObject I take its position and that’s where I place my hologram, ignoring the orientation that the SceneQuad objects within the SceneObject report. That might not be the right thing to do but I don’t take too much account of the orientation of the SceneObject itself either because I have code which treats it as a flat table-top and rotates the hologram to face the user with no rotation around X, Z.
But, if I simply took that transform returned by the SceneObject, could I use it as a form of anchor that would be stable across application invocations?
Purely from experimentation, that does seem to be the case and it could mean that I could write code which stores hologram positions relative to the position of that ‘anchor’ and have the application be able to restore those holograms just like a native spatial anchor or an Azure Spatial Anchor.
What would be interesting to me in that case would be to see what happens if I moved the table in question because I’d kind of expect the device to still see it as the ‘largest platform’ and so be able to find it again & potentially still restore the holograms relative to the new position of the table.
Whether that’s likely useful as an experience for a user is, of course, another question ๐
The other question I’d have is whether this form of anchor could be stable across different devices – i.e. if 2 devices look for the ‘largest platform in the room’ and then swap its transform between them then can they use that to establish a common coordinate system & offer a shared holographic experience based on it? I’ve not even experimented with this, it’s just a thought right now and I’ve no code to test it out in any way.
Speaking of ‘code’…
Code
As per the previous project, the code for this is all here on github and it’s all experimental so please apply a pinch of salt to what you see and, of course, always refer to the official documentation for the definitive view on these things.