Update–’Project Oxford’ Speaker Recognition APIs PCM Audio Fix

Just a quick update – if you took any of my code from either of these blog posts;

then (as per the details in the first post) I made a hack to alter the PCM audio stream that was being submitted to ‘Project Oxford’ as I found that the REST API would bounce my stream of audio.

That bug on the server-side now seems to be addressed so the code for those 2 samples just needs tweaking to take out the hack in question.

Speech Recognition and Identification with Windows 10/UWP and ‘Project Oxford’

This post is a direct follow-on from this earlier post which I wrote just 2 weeks ago about how ‘Project Oxford’ has preview APIs that can be used to verify that a piece of recorded speech belongs to a particular (pre-registered) speaker.

Not long after that post, I noticed a few news articles around one of the major UK high street banks starting to adopt voice identification technology as part of their customer login process. I’ve linked to one of the articles that I spotted on the BBC website below;


Just to be clear, I’ve no idea what sort of technology is in use by HSBC but it was very topical for me as I’d just recently been experimenting with those preview ‘Oxford’ APIs and trying out my own version of a similar thing and so it sparked my interest.

With that in mind, I put together this little follow-on sample below which is meant to show;

    • The idea of registering for voice verification by repeating a phrase 3 times over and submitting to the cloud such that a voice profile can be built.
    • The idea of then verifying a voice sample with the cloud to log in to a system.

This demo code uses both ‘Project Oxford’s preview APIs to do the speech verification but it also uses Windows 10 UWP speech APIs on the client side in order to do the work of handling the simple commands required to navigate through the process.

Clearly, the navigation here is a little clunky but you hopefully get the idea as to how it might work. Naturally, a bit more thought might need to be given around how a mechanism like this tried to avoid spoofing with recorded voices and so on but this is just some simple demo code rather than an actual banking system 🙂

The demo code is here for download. It won’t compile until you feed it a key for the ‘Project Oxford’ APIs. and you’ll probably notice that it’s quite simple, basic code.

I’d flag that the association between the user’s account number and the GUID that identifies their online voice profile is being stored locally in the app’s storage whereas, of course, in the real world you’d store this in the cloud somewhere.

“Project Oxford”–Speaker Identification from a Windows 10/UWP App

Following up on this earlier post, I wanted to get a feel for what the speaker identification part of the “Project Oxford” speaker recognition APIs look like having toyed with verification in that previous post.

It’s interesting to see the difference between the capability of the two areas of functionality and how it shapes the APIs that the service offers.

For verification, a profile is built by capturing the user repeating one of a set of supported phrases 3 times over and submitting the captured audio. These are short phrases. Once the profile is built, the user can be prompted to submit a phrase that can be tested against the profile for a ‘yes/no’ match.

Identification is a different beast. The enrolment phase involves building a profile by capturing the user talking for 60 seconds and submitting it to the service for analysis. It’s worth saying that all 60 seconds don’t have to be at once but the minimum duration is 20 seconds.

The service then processes that speech and provides a ‘call me back’ style endpoint which the client must poll to later gather the results. It’s possible that the results of processing will be a request for more speech to analyse in order to complete the profile and so there’s a possibility of looping to build the profile.

Once built, the identification phase is achieved by submitting another 60 seconds of the user speaking along with (at the time of writing) a list of up to 10 profiles to check against.

So, while it’s possible to build up to 1000 profiles at the service, identification only runs against 10 of them at a time right now.

Again, this submission results in a ‘call me back’ URL which the client can return to later for results.

Clearly, identification is a much harder problem to solve than verification and it’s reflected in the APIs here although I suspect that, over time, the amount of speech required and the number of profiles that can be checked in one call will change.

In terms of actually calling the APIs, it would be worth referring back to my previous post because it talked about where to find the official (non UWP) samples and has links across to the “Oxford” documentation whereas what I’m doing here is adapting my previous code to work with the identification APIs rather than the verification ones.

In doing that, I made my little test app speech-centric rather than mouse/keyboard centric and it ended up working as shown in the video below (NB: this video has 2+ minutes of me reading from a script on the web, feel free to jump around to skip those bits Smile);

In most of my tests, I found that I had to submit more than 1 batch of speech as part of the enrolment phase but I got a little lucky with this example that I recorded and enrolment happened in one go which surprised me.

Clearly, I’d need to go gather a slightly larger user community for this than 1 person to get a better test on it but it seems like it’s working reasonably here.

I’ve posted the code for this here for download – it’s fairly rough-and-ready and there’s precious little error handling in there plus it’s more of a code-behind sample lacking in much structure.

As before, if you want to build this out yourself you’ll need an API key for the “Oxford” API and you’ll need it to get the file named keys.cs to compile.