Following up on this earlier post, I wanted to get a feel for what the speaker identification part of the “Project Oxford” speaker recognition APIs look like having toyed with verification in that previous post.
It’s interesting to see the difference between the capability of the two areas of functionality and how it shapes the APIs that the service offers.
For verification, a profile is built by capturing the user repeating one of a set of supported phrases 3 times over and submitting the captured audio. These are short phrases. Once the profile is built, the user can be prompted to submit a phrase that can be tested against the profile for a ‘yes/no’ match.
Identification is a different beast. The enrolment phase involves building a profile by capturing the user talking for 60 seconds and submitting it to the service for analysis. It’s worth saying that all 60 seconds don’t have to be at once but the minimum duration is 20 seconds.
The service then processes that speech and provides a ‘call me back’ style endpoint which the client must poll to later gather the results. It’s possible that the results of processing will be a request for more speech to analyse in order to complete the profile and so there’s a possibility of looping to build the profile.
Once built, the identification phase is achieved by submitting another 60 seconds of the user speaking along with (at the time of writing) a list of up to 10 profiles to check against.
So, while it’s possible to build up to 1000 profiles at the service, identification only runs against 10 of them at a time right now.
Again, this submission results in a ‘call me back’ URL which the client can return to later for results.
Clearly, identification is a much harder problem to solve than verification and it’s reflected in the APIs here although I suspect that, over time, the amount of speech required and the number of profiles that can be checked in one call will change.
In terms of actually calling the APIs, it would be worth referring back to my previous post because it talked about where to find the official (non UWP) samples and has links across to the “Oxford” documentation whereas what I’m doing here is adapting my previous code to work with the identification APIs rather than the verification ones.
In doing that, I made my little test app speech-centric rather than mouse/keyboard centric and it ended up working as shown in the video below (NB: this video has 2+ minutes of me reading from a script on the web, feel free to jump around to skip those bits );
In most of my tests, I found that I had to submit more than 1 batch of speech as part of the enrolment phase but I got a little lucky with this example that I recorded and enrolment happened in one go which surprised me.
Clearly, I’d need to go gather a slightly larger user community for this than 1 person to get a better test on it but it seems like it’s working reasonably here.
I’ve posted the code for this here for download – it’s fairly rough-and-ready and there’s precious little error handling in there plus it’s more of a code-behind sample lacking in much structure.
As before, if you want to build this out yourself you’ll need an API key for the “Oxford” API and you’ll need it to get the file named keys.cs to compile.