free-form and are not confined to a predefined set of
domains, intents, or entities. The intent or domain
may be unclear and the slots may not be well
defined. For example, consider the following utterance: “Last night, I went to Justin Bieber’s show. He
was great, but the crowd was not. Do you think
should I go next time?” The goal of the utterance is
not well defined. The utterance consists of multiple
intents: information delivery, opinion sharing, and
opinion request. Furthermore, there are multiple
slots in the same utterance. Traditional NLU systems
do not work well for natural conversations. The following NLU components were developed during the
Alexa Prize competition to address these challenges.
To connect an Alexa user with a socialbot, we first
needed to identify whether the user’s intent was to
have a conversation with Alexa. We introduced a
“conversation intent” within the Alexa NLU model
to recognize a range of utterances such as “let’s chat,”
“let’s talk,” “let’s chat about <topic>,” and so forth,
using a combination of grammars and statistical
models. We further expanded the experience to other natural forms of conversational initiators such as
“what are we going to talk about,” “can we discuss
politics,” “do you want to have a conversation about
the Mars Mission,” and so on.
In the production system, if an utterance from an
Alexa user is identified as a conversation intent, then
one of the Alexa Prize socialbots is invoked and the
user interacts with that socialbot until the user says
stop. Following the detection of conversational
intent, the entire conversation is controlled by the
socialbots. Teams used a combination of Alexa Skills
Kit NLU along with their own NLU approaches, as
will be described.
NLU Techniques Adopted
by the Participating Teams
After a user has initiated a conversation, the socialbot
requires an NLU system to identify semantic and syntactic elements from the user utterance, including
user intent (such as opinion, chit-chat, knowledge,
and others), entities and topics (for example, the
entity “Mars Mission” and the topic “space travel”
from the utterance “what do you think about the
Mars Mission”), user sentiment, as well as sentence
structure and parse information. A certain level of
understanding is needed to generate responses that
align well with user intent and expectation and to
maximize user satisfaction. Although conversational
utterances do not generally follow intent-slot structure, teams brought several workarounds to address
NLU is difficult because of the inherent complexi-
ties within the human language, such as anaphora,
elision, ambiguity, and uncertainty, which require
contextual inference in order to extract the necessary
information to formulate a coherent response. These
problems are magnified in conversational AI since it
is an open domain problem where a conversation
can be on any topic or entity and the content of the
dialogue can also change rapidly. Some specific tech-
niques used by teams are listed in the following para-
Named Entity Recognition (NER)
Identifying and extracting entities (names, organizations, locations) from user utterances. Teams used
various libraries such as StanfordCoreNLP (Manning
et al. 2014), pacy, 1 and Alexa’s ASK NLU to perform
this task. NER is helpful for retrieving relevant information for response generation, as well as for tracking conversational context over multiple turns.
Intents represent the goal of a user for a given utterance, and the dialogue system needs to detect it to
act and respond appropriately to that utterance.
Some of the teams built rules for intent detection or
trained models in a supervised fashion by collecting
the data from Amazon Mechanical Turk or by using
open source datasets, such as Reddit comments, with
a set of intent classes. Others utilized Alexa’s ASK
NLU engine for intent detection.
Anaphora and Coreference Resolution
Finding expressions that refer to the same entity in
past or current utterances. Anaphora resolution is
important for downstream tasks such as question
answering and information extraction in multiturn
dialogue systems. Most of the teams used StanfordCoreNLP’s Coreference Resolution System (Manning
et al. 2014) to perform this task.
Some teams expanded user utterances with contextual information. For example, “Yes” can be transformed to “Yes, I like Michael Jackson” when uttered
in the context of a question about the singer, or “I
like Michael Jackson” can be extended to “I like
Michael Jackson, singer and musician” in a conversation where this entity needs disambiguation.
Teams wrote customized wrappers for preforming
sentence completion, which also involves querying
knowledge bases to obtain more information about
the entities, as described in the entity-linking section.
Topic and Domain Detection
Classifying the topic (for example, Seattle Seahawks)
or domain (such as sports) from a user utterance.
Teams used various datasets to train topic detection
models, including news datasets, Twitter, and Reddit
comments. Some teams also collected data from
Amazon Mechanical Turk to train these models.
Identifying information about an entity. Teams generally used publicly available knowledge bases such
as Evi, 2 FreeBase (Bollacker et al. 2008), and Wikidata. 3 Some teams also used these knowledge bases to
identify related entities.