Get started Bring yourself up to speed with our introductory content.

Why voice interfaces remain challenging

Voice recognition in the form of interfaces like Amazon Echo, Google Home and Siri from Apple is quickly becoming ubiquitous as part of smart home IoT setups. “Alexa, turn on the bedroom light.” It’s hardly a stretch to imagine that they’ll find their ways into IoT architectures designed for industrial monitoring and control as well. As a recent issue of The Economist noted, “Using it is just like casting a spell: say a few words into the air, and a nearby device can grant your wish.”

Voice recognition vs. natural language processing

Most people find that Echo-like devices with multiple microphones, noise-canceling technology and a broadband internet connection have become quite good at recognizing spoken words. They’re not perfect, but they get things mostly right even in a noisy environment. This is a big change from the days when even after training a system to understand a particular speaker, the accuracy remained low enough that voice interfaces usually weren’t very useful.

However, voice control also has negative parallels to the aforementioned wizard’s incantations; get them a little bit wrong and the results can be unexpected. That’s because the machine-learning (and associated computer science) advances that underpin voice recognition improvements haven’t yet led to equally broad improvements in understanding the intent and meaning of a sequence of words, i.e., natural language processing.

While the field has advanced considerably — consider IBM’s Watson playing Jeopardy — anyone who has tried to use voice control for anything complicated has experienced the confusion that often results from going off script.


The need to stick close to a script is limiting, in part, because voice isn’t a very discoverable interface. It’s a lot more like a command line with a blinking cursor than a graphical user interface with menu items and other on-screen prompts that explicitly lead you through permissible actions. Indeed, voice is arguably worse than a command line; with a command line, you can at least glance through help screens quickly. When we speak, we depend on succinct and immediate feedback if a word isn’t understood or some part of a request doesn’t make sense. Computers are generally far less good at deriving meaning from context and parsing freeform speech that doesn’t conform to a particular format or script.

Toward conversational systems

None of the above is particularly problematic if you’re primarily using voice control for repetitive simple commands. Perhaps you’re using voice to obtain a brief summary of standard operational data or for hands-free control of doors or other systems in a building.

However, one of the hopes for voice is to evolve it into virtual assistants that can handle a wide range of tasks for both businesses and individuals. Imagine a virtual travel agent or a diagnostic system that used a combination of sensors and an online knowledge base to carry out an extended verbal back-and-forth with a repair technician. Applications like these will require systems to understand context and be capable of carrying on a natural dialog.

Moving beyond contrived demo cases, virtual assistants need to be quite sophisticated to be truly useful. Everyone has experienced the frustrations of menu-driven phone trees (“Press 8 to…”). Virtual assistants that frequently misunderstand or require you to constantly reword requests aren’t very useful.

A human assistant or agent of even middling skill and experience can usually act on some general instructions for, say, booking a trip. They may not make all the final decisions, but they know enough about typical traveler behavior to present some reasonable options and ask reasonable follow-up questions such as, “Do you need a rental car?” They know or can find out where to go for information and they can interact with a wide range of incompatible booking systems. In other words, the assistant has a reasonable understanding of the context surrounding a trip to another city and can improve understanding through conversation. Furthermore, a personalized assistant can learn more about preferences and context as time goes on.

In 2016, Jim Glass, a senior research scientist at MIT, told Technology Review that “Speech has reached a tipping point in our society. In my experience, when people can talk to a device rather than via a remote control, they want to do that.” Voice recognition has improved to the state that it’s a fairly reliable way to give simple commands and get feedback from those commands in return.

However, although we see glimpses of ways that voice could be used as part of complex, interactive tasks, it’s hard to argue that it’s there today. As is the case across pretty much the entire span of artificial intelligence and machine learning, advances in certain domains can create the illusion that the broader problem is more solved than it is. In the case of voice, voice recognition has really gotten quite good given appropriate hardware. However, conversational systems that better understand the context of a physical world are not nearly so advanced.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.