80 AI MAGAZINE
At issue: The speech-driven reality presented in the
movie Her offers a realistic view — or not — of the
foreseeable future (such as 2025).
For: Lora Aroyo
I bet that the speech-driven reality presented in the
movie Her is a realistic view of the foreseeable future.
Here are some arguments in support of this position.
We already have mature hardware technology that
can provide a reasonable user experience with a
speech-controlled UI in the form of wireless devices
with an integrated assistant (for example, Google
Assistant, Siri, Alexa). Also, the relevant software
technology evolves quickly in terms of accuracy and
coverage of speech processing (for example, background noise and strong accents no longer present
substantial obstacles for speech understanding).
Speech is a convenient interaction medium
because it doesn’t require the user to hold or tap anything, and so speech interfaces allow for parallelization with other activities such as cooking, driving,
washing dishes, or painting. So far, convenience has
been a major factor in the adoption of initially imperfect systems, and this early use inevitably drives further improvement. As the number of early adopters
grows, we more quickly reach the tipping point for
large-scale adoption. In the interim, those initial
users provide data to demonstrate both that this is a
continuously growing market and that there is additional value in investing in the implementation of
the technology of this market, in this case various
types of interactions with virtual assistants.
Wide-scale usage is also going to drive innovation
in ML research in terms of finding ways to optimize
and generalize the process of creating new interactions, rather than implementing all of them from
scratch. Additionally, as more people start using
speech interfaces, there will be an increased incentive
for researchers to advance new areas of research, confronting barriers in terms of both hardware and software (for example, blocking sound from other people
talking in the same space).
Google’s search engine broke through the dense
search market of the time with its simplicity, that is,
by being focused on a single utility: search. It was
easy, unambiguous, and convenient for people to use.
This simplicity encouraged a steady growth in numbers of users, which meant also a steady increase in
the amount of usage data, and all that data, in turn,
eventually made the results much, much better.
We have all the paving stones now for the road to
simple and convenient speech-driven interaction.
And, yes, it is already quite noisy and annoying when
people talk on their phones in public spaces. But
there is no stopping it, so we may as well adapt. We
talk on our phones or into our headsets, while walk-
ing and while traveling, on the sidewalks and in the
streets, in buses, in trains, and in other public places.
We do it because it’s convenient: you don’t have to
take anything out of your bag, you don’t have to hold
anything in your hands — you just talk. As for adapt-
ing, many people are already walking on the street or
working at their desk with noise cancellation head-
sets. This current behavior makes it even easier to
adopt speech interaction with assistants in public
There are some counterexamples. For example, it
would be annoying if everybody at home talked to
their devices and didn’t communicate with one
another. Just as with texting and social media we
gained the ability to communicate instantly, at any
time, but also developed new social norms for when
and how to use that ability (or at least some of us
have), I believe that in the same way, we will develop
new social norms for where and how to use increased
Against: Chris Welty
In the movie Her, there is a scene in which the main
character is coming home from work and walking
across a large outdoor plaza. He is talking to his AI
assistant (and lover) through some kind of wireless
headset device in his ear. Across the plaza, there are
many people who also appear to be leaving work and
who are also talking on their headsets, presumably to
their computers. My reaction to this scene was that it
depicted an improbable future. Initially, my primary
reason was simply that I thought the social pressure
would prevent this scenario from becoming widespread, just as the social pressure not to talk to someone on your mobile phone in public prevents most of
us from doing it and leaves us feeling annoyed when
I expressed my skepticism to Lora Aroyo, who, on
the contrary, found this to be a nearly certain future.
We started an informal bet at that time, for and
against the prediction that assistant voice interfaces
would soon become mainstream — with “against”
meaning that most people would continue instead to
use keyboards, mice, and touchscreens to interact
with machines. At the instigation of the AI Bookies
column, we encouraged ourselves to go through the
process of formalizing the bet.
Like many bets and predictions, the need to avoid
an open-ended condition drove us to specify a time
limit — the year 2025.
The first obstacle we encountered was turning
mainstream into something measurable. What objective criteria would we use? As I began to think about
my motivations and to think about the problem in
more concrete provable/disprovable terms, I began to
think of more rigorous and serious reasons why I
believe speech will never become a mainstream interface.
While speech is a convenient medium for humans
to use when interacting with each other, we use it