For a while, I’ve had my eye on this documentation page which explains the ‘Scene Understanding SDK’ for HoloLens 2.
While a HoloLens device has a rich understanding of the space around it, building a representation as a 3D mesh which the developer can query, sometimes you’d prefer a higher level abstraction.
For example, I’ve seen many applications where the first step is to locate and place some content (map, building, car, etc.) onto a floor, a table or a wall. Having access to the 3D mesh is certainly essential to this type of functionality but there’s likely more work to be done with some kind of processing on that mesh to be able to reason about ‘surfaces’ rather than just polygons.
This is where ‘Scene Understanding’ comes in by enabling you to unlock more capabilities of the device in order to be able to reason in terms of ‘scene objects’ which, at the time of writing include (as the docs say!) objects of type;
Background, Wall, Floor, Ceiling, Platform, World and Unknown
Interestingly, working at this level doesn’t lose you the lower level detail in that it’s possible to use the APIs in the SDK to dig below the ‘scene objects’ into the quads/meshes that they are composed of.
Having read the documentation a couple of times, I wanted to make a simple Unity application to try things out and so I took the Mixed Reality Toolkit for Unity and brought it into a blank project in Unity.
In that project, I simply set up the Toolkit as per the docs and then cloned a few of the input system profiles so as to set up some custom speech commands which correspond to a subset of the ‘scene object’ types as you can see below;

I added a ‘marker’ prefab to show the X,Y,Z axes and dropped one at the origin of my scene so that it will sit right on the user’s head when they run the app (not usually the best idea);

I also added a ‘parent’ or Root object that I can use to parent any other content that gets created in the scene;

This object becomes important because I will align its coordinate system with that of the ‘scene’ that is returned by the ‘Scene Understanding’ APIs such that objects placed at positions & orientations under this parent can take their placement directly from the Scene Understanding APIs.
I made sure that my application had the capabilities to use the microphone and spatial perception and then I added a GameObject with a couple of scripts attached to it.
The first of those is just the standard toolkit Speech Input Handler which, as you can see below, routes the 4 speech commands that I have onto handlers in my script named MyScript;

The second script is the MyScript itself which provides handlers for those 4 calls. It handles them by calling into the ‘Scene Understanding’ APIs trying to find ‘scene objects’ of the type in question (e.g. walls, platforms) and then it tries to display;
- An instance of the ‘marker’ prefab at the position and orientation of the ‘scene object’
- An instance of a (flattened) cube at the position and size of each of the quads that makes up that scene object.
and that’s pretty much it. My script then needs access to this parent object, the prefab to create and also a material to texture the cube with and so these are parameters to the script as below;

That’s pretty much it – I built this out and ran it in my small home office and tried out saying the ‘platform’ voice command and the results were really nice 😉 as below;

and, similarly, trying out the keyword ‘ceiling’ seemed to work reasonably too;

The script that’s sitting behind this is very experimental but it’s largely just packaging up the pieces from the documentation and it’s listed below. It seemed to work quite well quite quickly so it’s more than possible that I am benefiting from some kind of ‘false positive’ and perhaps it’s working better in this one location than it does in others but, still, it’s a starting point 🙂
using System.Collections.Generic;
using UnityEngine;
using System;
using System.Threading.Tasks;
using NumericsConversion;
#if ENABLE_WINMD_SUPPORT
using Microsoft.MixedReality.SceneUnderstanding;
using Windows.Perception.Spatial;
using Windows.Perception.Spatial.Preview;
using UnityEngine.XR.WSA;
#endif
public class MyScript : MonoBehaviour
{
public GameObject parentObject;
public GameObject markerPrefab;
public Material quadMaterial;
public MyScript()
{
this.markers = new List<GameObject>();
this.quads = new List<GameObject>();
this.initialised = false;
}
void Update()
{
#if ENABLE_WINMD_SUPPORT
// TODO: Doing all of this every update right now based on what I saw in the doc
// here https://docs.microsoft.com/en-us/windows/mixed-reality/scene-understanding-sdk#dealing-with-transforms
// but that might be overkill.
// Additionally, wondering to what extent I should be releasing these COM objects
// as I've been lazy to date.
// Hence - apply a pinch of salt to this...
if (this.lastScene != null)
{
var node = this.lastScene.OriginSpatialGraphNodeId;
var sceneCoordSystem = SpatialGraphInteropPreview.CreateCoordinateSystemForNode(node);
var unityCoordSystem =
(SpatialCoordinateSystem)System.Runtime.InteropServices.Marshal.GetObjectForIUnknown(
WorldManager.GetNativeISpatialCoordinateSystemPtr());
var transform = sceneCoordSystem.TryGetTransformTo(unityCoordSystem);
if (transform.HasValue)
{
var sceneToWorldUnity = transform.Value.ToUnity();
this.parentObject.transform.SetPositionAndRotation(
sceneToWorldUnity.GetColumn(3), sceneToWorldUnity.rotation);
}
}
#endif
}
// These 4 methods are wired to voice commands via MRTK...
public async void OnWalls()
{
#if ENABLE_WINMD_SUPPORT
await this.ComputeAsync(SceneObjectKind.Wall);
#endif
}
public async void OnFloor()
{
#if ENABLE_WINMD_SUPPORT
await this.ComputeAsync(SceneObjectKind.Floor);
#endif
}
public async void OnCeiling()
{
#if ENABLE_WINMD_SUPPORT
await this.ComputeAsync(SceneObjectKind.Ceiling);
#endif
}
public async void OnPlatform()
{
#if ENABLE_WINMD_SUPPORT
await this.ComputeAsync(SceneObjectKind.Platform);
#endif
}
void ClearChildren()
{
foreach (var child in this.markers)
{
Destroy(child);
}
foreach (var child in this.quads)
{
Destroy(child);
}
this.markers.Clear();
this.quads.Clear();
}
#if ENABLE_WINMD_SUPPORT
async Task InitialiseAsync()
{
if (!this.initialised)
{
if (SceneObserver.IsSupported())
{
var access = await SceneObserver.RequestAccessAsync();
if (access == SceneObserverAccessStatus.Allowed)
{
this.initialised = true;
}
}
}
}
async Task ComputeAsync(SceneObjectKind sceneObjectKind)
{
this.ClearChildren();
await this.InitialiseAsync();
if (this.initialised)
{
var querySettings = new SceneQuerySettings()
{
EnableWorldMesh = false,
EnableSceneObjectQuads = true,
EnableSceneObjectMeshes = false,
EnableOnlyObservedSceneObjects = false
};
this.lastScene = await SceneObserver.ComputeAsync(querySettings, searchRadius);
if (this.lastScene != null)
{
foreach (var sceneObject in this.lastScene.SceneObjects)
{
if (sceneObject.Kind == sceneObjectKind)
{
var marker = GameObject.Instantiate(this.markerPrefab);
marker.transform.SetParent(this.parentObject.transform);
marker.transform.localPosition = sceneObject.Position.ToUnity();
marker.transform.localRotation = sceneObject.Orientation.ToUnity();
this.markers.Add(marker);
foreach (var sceneQuad in sceneObject.Quads)
{
var quad = GameObject.CreatePrimitive(PrimitiveType.Cube);
quad.transform.SetParent(marker.transform, false);
quad.transform.localScale = new Vector3(
sceneQuad.Extents.X, sceneQuad.Extents.Y, 0.025f);
quad.GetComponent<Renderer>().material = this.quadMaterial;
}
}
}
}
}
}
#endif
#if ENABLE_WINMD_SUPPORT
Scene lastScene;
#endif
List<GameObject> markers;
List<GameObject> quads;
bool initialised;
static readonly float searchRadius = 5.0f;
}
I’ve also shared the whole project up here on github if you want to have a play with it yourself – I’m pleased to have got started with this myself & looking to doing some more exploration of what’s possible here.
Pingback: More Scene Understanding – The Largest Table in the Room… – Mike Taulty
Pingback: More Scene Understanding – “Put This on the Big Table” – Mike Taulty