A Crash Course in Voice Technology

April 5th
5 mins read

Remember Her? The 2013 film where Joaquin Phoenix falls in love with his digital assistant, an empathetic talking computer voiced by Scarlett Johansson? When Her came out in theatres, it was generally perceived as your classic science fiction flick—but fast forward to 2019, and you can pretty much cross out the ‘fiction’ from that label. That’s because over the past few years, voice technology has developed so profoundly that we can now talk to computers in a way that seemed, well, impossible until now. Indeed, some studies predict that by 2021, there’ll be more digital assistants than people on the planet—and that the better these assistants get at understanding us, the more emotionally attached we’ll grow to them in return.

A Crash Course in Voice Technology

Raising Your Voice

In a way, the appeal of voice technology is a no-brainer: asking a digital assistant to find you an affordable Thai restaurant nearby is inherently much less clunky than pulling out your phone or computer, navigating to a search engine, searching for a Thai restaurant and then comparing prices yourself. As one journalist so aptly put it, ‘using a keyboard and mouse to manipulate a computer after successfully using voice feels about the same as using a command-line interface on an old UNIX machine after using a graphical interface.’ Following that logic, it’s no surprise that people’s appetites are already rather hearty for digital assistants around the globe. Take the US, where purchases of Amazon Echo—a speaker controlled by your voice and Alexa, Amazon’s digital assistant—have shattered all sales estimates (apparently over 20 million units have been sold, to be exact); or China, where Baidu’s DuerOS—a natural language-based, conversational human computer platform—is integrated in over 100 brands of home appliances, from refrigerators to TVs.

But apart from simply being convenient, voice technology could significantly alleviate the struggles of the 285 million visually or manually impaired people around the world: providing an alternative to typing on a keyboard, voice technology can benefit people with dyslexia, or those who find typing difficult or impossible. Better yet, voice technology could help interpret the world in real-time for its users: just imagine walking down the street and having a digital assistant describe the colour of the leaves, level of sunlight or shape of a building to you as you go along.

Alexa, why won’t you speak up?

Indeed, the potential of voice technology is exciting for both users and the pioneering companies behind it—but so far, the emphasis still lies on the word ‘potential.’ Because although the public seems quite eager to increasingly invite digital assistants into their lives—particularly in China and India, where 55 percent of people use digital assistants—these handy, talking bots aren’t perfect by a long shot.

Some of the flaws seem quite funny and harmless, like when Alexa emits evil cackles out of turn and semi-freaks out her users, or when Siri tells terrible jokes that totally miss the ball on cultural context. But some are altogether more concerning and ethically dubious: for example, a 2016 study found that smartphone assistants don’t know how to react to users who complain of serious issues like physical threats to their safety or depression that poses a risk of harm. And another study discovered that the majority of digital assistants are complacent in the face of sexual harassment and gender-based insults from users. (Considering that most digital assistants sound like women, this raises all sorts of thorny questions around the gendered stereotypes voice technology quietly exacerbates.) Fundamentally, these flaws come down to how voice technology works and the process by which digital assistants learn and speak.

Learning the Language

First of all, digital assistants operate thanks to a dance between speech recognition and natural language processing. Speech recognition software essentially analyses the sounds you make by filtering what you say and digitising your words to a format it can ‘read’, thereby recognising your words. Its accuracy is substantially improving: in between 2013 and 2015 alone, Google’s word accuracy rate rose from 80 percent to an impressive 90 percent—and speech recognition error rates now match human parity at 5 percent. But identifying words is one thing and understanding them is another—which is where natural language processing (NLP) comes in.

NLP works by using algorithms and information a software has been fed previously to make that software understand what you’re saying and comb out the information you’re looking for. The more examples of speech and sentences a software hears, the better it gets at learning to analyse vocal input in precise detail. But since NLP works by making machines learn based on examples they’ve seen before, the scope of context they’re exposed to is inherently limited. As a result, our digital assistants can talk to us, no doubt—but they aren’t that good at having a conversation.

Let’s take Siri as an example. Say you were to ask Siri the temperature outside: your phone or smart speaker would first try to figure out if it could handle the command without needing additional information from Apple’s servers. Finding out the current temperature is a relatively simple task as your iOS can already do that for you, so Siri would likely answer correctly straight away. However, if you were to phrase the question a bit differently—i.e. ‘Siri, is it warm enough outside for me to wear a skirt without tights?’—your inquiry would automatically become more complex and also contextual. At that point, Siri would package up your request and send it to Apple’s servers. After that, an algorithm would match the words and tone of your request to a command it thinks you asked. And if you used new slang in your request, or spoke a bit quickly, or minced your words, the algorithm could get even more confused—and Siri would come back to you with a rather weird answer.

From Small Talk to Real Talk

As a result, digital assistants are helpful but rather inconsistent, which means that people today aren’t completely sold on the idea of integrating digital assistants into their daily lives: a study carried out by J.W.T. Intelligence found that 29 percent of all non-voice users say they simply don’t see the point of it, and 69 percent of people who already use voice wish that they could speak to it like they do to a human. So it only makes sense that those who have invested the most into voice technology—namely, Google, Amazon, Apple, Microsoft, Samsung and Baidu, whose digital assistant DuerOS has accumulated more in conversational skills than any other interface—are hard at work to change that. For example, Amazon has held a competition where the brightest computer science students on the planet were tasked with transforming Alexa from a digital assistant who can answer simple questions to a companion that can have a conversation with you just like a friend. (For the record, the winning team built a chatbot that kept a user engaged for almost twenty minutes with pop culture references, jokes and rapid-fire responses.)

The Ultimate Logo is Invisible

And it’s no coincidence that the planet’s biggest tech giants are the ones moving voice technology forward. On the base level, they have the financial and human resources available to push research into this field further than most; but in tandem with that, these brands are actively competing against each other to build the go-to voice platform that most people will use. The payoff of that is that voice users will become increasingly loyal to the brand, but also more consistently exposed to services and products offered by them. For example, if you tell Alexa you need more soap rather than choosing a soap brand on a screen or in a shop, Amazon will make the decision for you. As in: Amazon will get to take a much more intimate role in your shopping habits, and also decide which of its soap suppliers it promotes and which it doesn’t, presumably based on how much these suppliers are willing to pay.

But perhaps more importantly, these tech giants recognise that voice technology is the future of brands from an emotional standpoint. Almost half (43 per cent) of regular voice technology users globally say that they love their voice assistant so much that they wish it were a real person—and bizarrely, almost 30 percent of global voice technology users have admitted to having had a sexual fantasy about their digital assistant. These statistics prove that voice technology’s potential to intimately integrate into our lives is quite staggering—which could give brands unprecedented opportunities to become an emotional and vital part of people’s lives. In light of that, a few questions become important.

What should the voice sound like? What kind of personality should it have? What accent should it speak with? Should it be male or female? Will it speak with other voice assistants, and if so, why? For example, these are some of the problems that have been powering Baidu’s Deep Voice project which has contributed to DuerOS’ success: it focuses on teaching machines to generate speech from text to sound more human like, and has already managed to demonstrate that a single system could learn to reproduce thousands of speaker identities with less than a half hour of training.

Eager to Eavesdrop?

Of course, privacy concerns are rampant. 45 percent of potential voice users globally say they would be encouraged to use voice if there were guarantees around personal data security, and 50 percent of people worry about companies listening in to the conversations they have with their voice assistant. With that in mind, the big movers behind voice technology are obliged to be relatively transparent about their privacy policies: for example, we know that all the ways in which you interact with Alexa are stored to your Amazon account, whereas Apple keeps your interactions with Siri almost entirely separate from your Apple ID. It remains to be seen if those existing privacy policies are enough to convert the majority of the population into accepting voice technology as the natural progression of how we use technology on a daily basis.

To sum it up, voice technology is a field full of potential to transform how we interact with technology—but also full of parallel opinions. Will voice technology help us interact more with each other since it’ll pull us away from constantly staring at screens, as 53 percent of people tend to believe—or will it cause us to all ‘soon be lying about like dying slugs’, as one particularly skeptical journalist puts it? Will independent companies be able to carve out space for themselves to build successful voice technology platforms, or will the tech giants that already dominate our lives leverage voice technology to become even more pervasive and all-encompassing? No one has the answers yet, but one thing is for sure: talking computers are no longer the stuff of science fiction. They’re here. They’re here to stay. And they’re about to affect our lives in more tangible ways than we could have ever imagined.