Since my last article on voice recognition in Cupcake, two things have happened. The Haykuro images came out and Google released the 1.5 SDK. That means it’s time to test out the voice recognition on a real phone. I’ve compiled some example code that comes with the SDK below. You can install it if you have the Cupcake over-the-air update, the 1.5 ADP, or a Haykuro image.
To install, look for Voice Recognition in the Market (Applications > Demo) or scan the QR code at the bottom of the post. Once it’s installed, you just push the button, speak into your phone and Google’s best guess appears on the screen. Very simple, but some pressing questions have been answered.
It looks like voice recognition will take this form for all applications. The application starts the voice recognizer which prompts the user, displays the volume widget, shows the waveform of the recording, and then returns the results to the application. It’s not perfectly seamless but it gets the job done in a fairly small amount of time.
Note that the processing is not CPU intensive on the phone, but it will need to be sent to Google to be turned into text (thanks to Tim H for pointing this out on the last article). You’ll need a good internet connection for this. I found EDGE (non-3G data connection) to be a bit unreliable. About half got sent back with a connection error. That will definitely vary, but the worst case scenario is not good, especially if you’re trying to use this for possibly frustrating operations like speaking commands to your GPS navigation system. WiFi works much better and the results are fast.
I like the fact that you get good feedback from the volume widget and the waveform as to how you should be speaking. Since it doesn’t adapt itself to your speaking patterns, you’ll have to accommodate it. Also, if you don’t have an American accent, you might have a tough time getting good results. But, it does work, and Google says it’s getting better with practice. And since Google is one of the few companies on Earth with billions of dollars and access to millions of people’s voice searches, I’d say they’ve got as good a chance as anyone of getting this right.
Internally, the voice recognition interface is pretty simple. Developers will get their choice of models, either free-form or web search based, and they get a list of possible results instead of just one. That allows them to implement their own language model. If three responses come back, the application can choose the one most fitting to its cause.
Suppose the application wants a simple “yes” or “no.” It could accept anything starting with a “Y” as yes and “N” as no. “Yup,” “Yeah,” “You betcha”… as well as “yurt,” “yam”, and “yaw” depending on who’s speaking and how Google interprets it. You’re almost guaranteed a good match. In fact, the smaller the search space, the greater the likelihood of a good match, so a yes/no could be simplified to a silence versus response scheme that would work in any language. (In my tests, “yes” worked pretty well, but “no” consistently came up with “snow.”)
The language model is set to free-form on the sample, so results should differ slightly from the regular voice search.
Good luck! If you have funny, interesting, insightful results, please post them here.
Developed By: Alex Byrnes