Following up on my previous post, I said that I’d return to ‘Project Oxford’ and its capability to do speaker verification and identification.
You can read the documentation on this here and you can try it out on the web;
and I learned late in this process that while there does not seem to be a client SDK package as such for the Windows10/UWP developer that surfaces these REST APIs into your environment there are some .NET samples;
and if I’d realised that up front then I could have saved myself quite a lot of effort That said, those .NET samples seem to use NAudio in order to generate an audio stream and I wanted to use UWP APIs to do that and that’s where most of my effort went in trying out the ‘Oxford’ speaker recognition APIs.
In my own terms, verification is where you have an idea whose voice you are listening to and you want ‘Oxford’ to confirm it. Identification, on the other hand, is where you want ‘Oxford’ to take the voice and tell you who it belongs to.
I’ve only really dug into verification so far but I think that in both cases you work with ‘Profiles’ and for verification the process runs something like this;
- You make a REST call to ask the system for a set of supported verification phrases. You choose one to present to your user.
- You make a REST call to create a new profile (for the user)
- You need to repeat at least 3 times what ‘Oxford’ calls ‘enrolment’ for a profile;
- Capturing the user saying the SAME verification phrase in a WAV container containing PCM audio at 16K, 16-bit mono.
- Make a REST call sending the speech to the service and checking that it is happy to use it for enrolment.
- Once the user is fully ‘enrolled’, you can now verify the user’s voice by capturing them speaking one of the phrases again and then making a REST call with that speech against that profile and the system says “Yes” or “No” and provides a confidence factor on that response.
That’s about it when reduced down to a few sentences. I think the service currently supports 1000 profiles so, effectively, 1000 users.
I wrote a little Windows 10 UWP app to try out the verification service and, lacking the discovery of the WPF sample, I wrote a bunch of classes that I could have more or less taken straight from that sample code if I’d known about it before I started;
I didn’t write mountains of code but it’s perhaps too big to go into a blow-by-blow account of here and most of it is just REST calls and JSON de-serialization.
I did, though, spend quite a long time on it and for one reason only…
Struggles with Posting WAV Files containing PCM Audio
I used the UWP AudioGraph APIs to create a byte stream containing a WAV container wrapping a PCM audio stream sampled at 16K in mono with 16-bits per sample.
That’s what ‘Oxford’ said that it wanted and that’s what I tried to give it. Many times over
Every time I submitted one of these byte streams to the ‘Oxford’ speaker recognition enrolment APIs I would get some kind of ‘PCM is required’ error.
After much head scratching, I found myself going back and reading a resource that I’ve used before on WAV file formats (like when I wrote this old post).
I found that if I opened up a WAV file from c:\windows\media with the binary editor in Visual Studio then I would see a header something like;
whereas the WAV byte stream that was coming out of the Windows 10 UWP AudioGraph APIs looked like this;
Clearly, the characters ‘JUNK” stood out for me here and, in as far as I can tell, this second WAV file contains a ‘JUNK chunk’ of 36 bytes. As far as I know, this is valid and is certainly mentioned here and it doesn’t seem to cause Groove music or Windows Media Player any troubles.
However…it did seem to cause troubles for the ‘Project Oxford’ API in that I found that when my stream contained the ‘JUNK chunk’, I got errors from the API whereas once I removed that chunk (and patched up the file length which is also stored in the stream) then things seemed to work so much better
Consequently, my code currently has this slightly hacky method which hacks the byte stream that comes out of the AudioGraph APIs before it gets sent to ‘Oxford’;
byte[] HackOxfordWavPcmStream(IInputStream inputStream, out int offset) { var netStream = inputStream.AsStreamForRead(); var bits = new byte[netStream.Length]; netStream.Read(bits, 0, bits.Length); // original file length var pcmFileLength = BitConverter.ToInt32(bits, 4); // take away 36 bytes for the JUNK chunk pcmFileLength -= 36; // now copy 12 bytes from start of bytes to 36 bytes further on for (int i = 0; i < 12; i++) { bits[i + 36] = bits[i]; } // now put modified file length into byts 40-43 var newLengthBits = BitConverter.GetBytes(pcmFileLength); newLengthBits.CopyTo(bits, 40); // the bits that we want are now 36 onwards in this array offset = 36; return (bits); }
Once I had that code in place, I seemed to be able to submit audio to ‘Oxford’s speaker verification endpoints for both enrolment and verification with no hassle whatsoever.
I’m not sure whether I could somehow push the AudioGraph APIs to not emit this JUNK chunk but, for now, I’ve gone with the hack after wasting quite a few hours trying to figure it out.
In Action
Putting it together, I have the most basic UI which runs like this – the demo below is a little bit ‘dry’ as I’ve not got a community of users to try it out with as I write this post but maybe you can try it out yourself and see how it works for you with a set of voices;
Code
If you want the code, it’s here for download but there’s a #error in a file called Keys.cs where you would need to supply your own API key for Oxford’s speaker recognition API.
Identification?
With the code as it is, I don’t think it’d be too hard to add on calls to the identification service and try that out as well. I might get there in a follow on post…
Beyond?
Voice verification/identification here is exciting stuff You could imagine building a system (web connected) that used Oxford’s facial and voice verification to add 2/3 factor authentication to something like a secure swipe-card door access system.
At the time of writing, facial recognition is available on device (Intel’s RealSense SDK for desktop apps) and is based on depth camera images (like those used in ‘Windows Hello’).
For voice recognition though I’m not aware of an on-device service whether for desktop or UWP so it’s exciting to see this up at ‘Oxford’ in preview (free) for developers to experiment with.
Give it a whirl
Pingback: Ginktage – Community Blog Posts – February 10 , 2016 | Ginktage
Pingback: “Project Oxford”–Speaker Identification from a Windows 10/UWP App – Mike Taulty
Pingback: Speech Recognition and Identification with Windows 10/UWP and ‘Project Oxford’ – Mike Taulty