Voice Recognition: Will Its Time Ever Come?
Every now and then some analyst or journalist gets excited about voice recognition – and declares that “this is a technology whose time has come”. But no matter, the technology still fails to become pervasive. I was tempted to say something of the kind when Nokia mobile phone handsets started to get sophisticated, and I would have been wrong.
I knew long ago that voice recognition itself is simply a matter of processor power. A researcher at IBM told me that. In fact he told me that IBM had the technology (the algorithm and the software) for years and was just waiting for processing power to catch up. The evolution of voice recognition followed these steps:
- First a limited vocabulary (spoken distinctly) was recognizable.
- Next a wider vocabulary, but not if you ran words together or had a “difficult” accent.
- Next accents were fixed by better training.
- Next you could even get away with running words together.
Some of this was achieved by tweaking the algorithm, but most of it was simply faster processors. In terms of actual recognition, it parallels handwriting recognition, which was pretty much cracked by the Palm Pilot. Admittedly that was almost cheating; forcing people to write the way the device wants them to, rather than how they usually write.
I’ve bought voice recognition technology three times; each time a version of IBM’s Via Voice and each time I’ve tried it and dumped it. I like the idea of being able to dictate into a word processor and I’d be happy to dictate my blog, but there’s a mismatch between “the human speaking process and the computer interface” and until someone solves the problem here, the use of voice recognition technology will be limited to those who have to use it.
So what’s the problem?
Well, there’s a minor problem: that it’s a little antisocial in some environments to talk to a device. That’s becoming less of a problem as people become inured to other people speaking into mobile phones in just about every context you can imagine; in coffee shops and supermarkets and on trains. I like Louis Black’s comment on this; “If God had wanted you to have loud one-way conversations in public, he’d have made you into a crazy person”. But nevertheless, we’ve become used to it; so talking to PCs in office environments is going to be acceptable in time.
But there’s also a major problem: Voice interfaces are by necessity linear. They are, in fact, command line interfaces just like Unix.
I say something, then you say something.
The linearity is the problem. You must have noticed how frustrating it is to deal with some automated voice systems because at every turn they have to offer you several choices, plus the choice of hearing it over in case you’ve forgotten some of the choices. If you offer more than 5 choices the user has normally forgotten choice number one when choice five is being described. (Our brains don’t cache information very well).
A 2 dimensional (or even 3 dimensional) interface is vastly richer than the one dimensional interface of voice can ever be – and it rarely expects you to cache information. So the natural thing to do with voice is to slot it into such an interface and use it where it makes sense (in dictation scenarios, for example) and don’t try to force it into scenarios it was never meant for. You also need to co-ordinate the whole sound channel so that, if you are listening to music and want to use voice, the music fades away automatically.
The voice interface is actually just a subset of an interface – unless all you have is a microphone and an ear piece (which, incidentally, is all you ever can have if you’re blind).
As things stand, some things can be done to improve the use of voice within the user interface, but it is possible that voice recognition has gone almost as far as it ever will. We can improve the single dimension that voice gives us, but I don’t see anyway for it ever to become two dimensional.



















