Thursday, November 8, 2007

Audio will drive the mobile web? Maybe.

The other week Barry Welford proposes that voice recognition is the killer-enabling-technology behind a future explosion of the mobile web. I am of two minds about audio for the web. In output, I love it. I heavily used the Sprint voice web in the past (much of it has become out of date), and it was not that well designed in many aspects, and pure audio. I've been waiting for years for the technology to allow simultaneous audio and data streams; now I am waiting for someone to take advantage of it. Audio with the visual web should work really well. It can tie directly into portal theory, with the read-back version being a summary (due to speed), then slightly more detail in a glanceable view, and more info if you scroll or click. Its probably less useful for long articles, but the principle could be inverted so the visual component supports the audio stream as I have mentioned before. Not sure what I think of visual voicemail systems. They are nudging this way, but none thrill me totally as yet. Input-wise, I have grave concerns. Star Trek, et. al. use voice to engage the viewer in the otherwise very individual-centric behavior of interacting with a computer. But, this is exactly why it does not work in my experience. Recognition quality is not a concern of mine. As long ago as the early '90s, I was using OS/2 almost exclusively, and it worked perfectly for interaction, and so well for word recognition as to be actively usable for typing. I'll even ignore the speed (its a LOT slower to talk than to point or type, low-literacy folks aside). The problem is the required isolation. Consider the use of an IVR today: how much has it been confused because someone next to you talks? How often have you had to go away, or wait to make such a call because your mom keeps wondering who you are talking to? I cannot think of a way around this, so voice input just seems like an insurmountably niche product to me. I hear, never having been there, Japanese commuter trains are very quiet, despite practically everyone clicking away on their mobile. Can you imagine them all talking to their phones instead?


BWelford said...

Well we may have to agree to differ. However I think with a dedicated user of a mobile device, the training process can be significantly improved through time so that there is better identification of the true signal within all the 'noise'. Since technology moves so fast, I don't think this is a serious problem. Anyway thanks for starting this dialogue.

shoobe01 said...

Oh, I am all about disagreement and discussion.

I'm often rambling, so to clarify my point: I'm really not so terribly worried about the technology (you are right -- if it doesn't work, it soon will) as much as I am the social and usefulness aspects.
1) Its much slower than typing. For control and cueing, there's generally going to be a translation or mapping to create verbal commands. Useful for certain hands-off situations, and for lower-literacy populations, but hard to believe it will get mainstream.
2) A voice input system will demand focus, if not technologically, then psychologically. A well-functioning one will tend to isolate the user in one way or another even more. That worries me on several levels; shouldn't we be trying to make mobiles work for the user, and within their current social behaviors.

Now, if there was a way to integrate audio I/O for a group/social setting, that might add a whole new interesting dimension.