encoder-decoder (VHRED) (Serban et al. 2016) model, in addition to other neural network models such
as skip thought (Kiros et al. 2015) to produce hybrid
retrieval-generative candidate responses. Some
teams (such as Pixie)(Adewale et al. 2017) 4 used a
two-level, long short-term memory (LSTM) model
(Hochreiter and Schmidhuber 1997) for retrieval.
Eigen (Guss et al. 2017) 4 and RubyStar, on the other hand, used dynamic memory networks
(Sukhbaatar et al. 2015) and character-level recursive neural networks (RNN) (Sutskever, Vinyals, and
Le 2014) for generating responses. Alquist used a
sequence-to-sequence model (Sutskever, Vinyals,
and Le 2014) specifically for their chit-chat module.
While these teams deployed the generative models
in production, other teams also experimented with
generative and hybrid approaches offline.
Ranking and Selection Techniques
Open-domain social conversations do not always
have a specific goal or target, and the response space
can be unbounded. There may be multiple valid
responses for a given utterance. As such, identifying
the response that will lead to the highest customer
satisfaction and help drive the conversation forward
is a challenging problem. Socialbots need mechanisms to rank possible responses and select the
response that is most likely to achieve the goal of a
coherent and engaging conversation in that particular dialogue context. Alexa Prize teams attempted to
solve this problem with either rule-based or model-based strategies.
For teams that experimented with rule-based
rankers, a ranker module chose a response from the
candidate responses obtained from submodules
(such as topical modules or intent modules) based on
some logic. For model-based strategies, teams utilized
either a supervised or reinforcement learning
approach, trained on user ratings (Alexa Prize data)
or on predefined large-scale dialogue datasets such as
Yahoo! Answers, Reddit comments, Washington Post
Live comments, and OpenSubtitles. The ranker was
trained to provide higher scores to correct responses
(for example, follow-up comments on Reddit are
considered correct responses) while ignoring incorrect or noncoherent responses obtained by sampling.
Alan (Papaioannou et al. 2017), 4 for example, trained
a ranker module on Alexa Prize ratings data and combined that with a separate ranker function that used
hand-engineered features. Teams using a reinforcement learning approach developed frameworks
where the agent was a ranker, the actions were the
candidate responses obtained from submodules, and
the agent was trying to maximize the trade-off
between selecting a response to satisfy the customer
immediately and selecting one that takes into
account some long-term reward. MILABot, for example, used this approach and trained a reinforcement
learning ranker function on conversation ratings.
The afore-mentioned components form the core of
socialbot dialogue systems. In addition, we developed the following components to support the competition.
Conversational Topic Tracker
To understand the intent of a user, it is important to
identify the topic of the given utterance and corresponding keywords. Alexa Prize data is highly topical
because of the nature of the social conversations.
Alexa users interacted with socialbots on hundreds of
thousands of topics in various domains such as
sports, politics, entertainment, fashion, and technology. This is a unique dataset collected from millions
of human–conversational agent interactions. We
identified the need for a conversational topic tracker
for various purposes such as conversation evaluation
(for example, coherence, depth, breadth, diversity),
sentiment analysis, entity extraction, profanity
detection, and response generation.
To detect conversation topics in an utterance, we
adopted deep average networks (DAN) and trained a
topic classifier on interaction data categorized into
multiple topics. We proposed a novel extension by
adding topic-word attention to formulate an atten-tion-based DAN (ADAN) (Guo et al. 2017) that allows
the system to jointly capture topic keywords in an
utterance and perform topic classification. We fine-tuned the model on the data collected during the
course of the competition. The accuracy of the model was obtained to be 82. 4 percent on 26 topical classes (sports, politics, movies_TV, and so on). Furthermore, the topic model was also able to extract
keywords corresponding to each topic. We used the
conversational topic tracker to evaluate the socialbots on various metrics such as conversational
breadth, conversational depth, and topical and
domain coverage (Venkatesh et al. 2017; Guo et al.
2017). We will explore additional details on evaluation later in this article.
Inappropriate and Sensitive
Content Detection
One of the most challenging aspects of delivering a
positive experience to end users in socialbot interactions is to obtain high-quality conversational data.
The datasets most commonly used to train dialogue
models are sourced from internet forums (for example, Reddit, Twitter) or movie subtitle databases (for
example, OpenSubtitles, Cornell Movie-Dialogs Corpus). These sources are all conversational in structure
in that they can be transformed into utterance-response pairs. However, the tone and content of
these datasets are often inappropriate for interactions
between users and conversational agents, particularly when individual utterance-response pairs are taken out of context. In order to effectively use dialogue
models based on these or other dynamic data
sources, an efficient mechanism to identify and filter