Table 1. Correlation of the Regression Model with User Ratings.
Algorithm RMSE Spearman Pearsonr
Random 2.211 0.052 0.017
HLSTM 1.392 0.232 0.235
GBDT 1. 34 0.352 0.351
Domain coverage: Entropy analysis of conversations
against the five socialbot domains for Alexa Prize
(Sports, Politics, Entertainment, Fashion, Technology). Performance was targeted on high entropy, while
minimizing the standard deviation of the entropy
across multiple domains. High entropy ensures that
the socialbot is talking about a variety of topics, while
a low standard deviation gives us confidence that the
metric is applied equally across domains.
Topical diversity: Obtained using the size of topical
vocabulary for each socialbot. A higher topical vocabulary within each domain implies more topical affinity.
Conversational depth: We used the topical model to
identify the domain for each individual utterance.
Conversational depth for a socialbot was calculated as
the average of the number of consecutive turns on the
same topical domain, where single turn corresponds
to user utterance and corresponding bot response pair
within a conversation. Conversational depth evaluates the socialbot’s ability to have multiturn conversations on specific topics within the five domains.
Selecting Alexa Prize Finalists
The Alexa Prize competition was structured to allow
users to participate in the selection of finalists. Two
finalists were selected purely on the basis of user ratings averaged over all the conversations with those
socialbots. At the end of the conversation, users were
asked to rate how coherent and engaging the conversation was.
In addition, one finalist was selected by Amazon
based on internal evaluation of coherence and
engagement of conversations by over one thousand
Amazon employees who volunteered as Alexa Prize
judges, on analysis of conversational metrics com-
puted over the semifinals period, and on scientific
review of the team’s technical papers by senior Alexa
scientists. The quality of all the socialbots was also
analyzed based on the metrics mentioned above. We
observed that a majority of those metrics correlate
well with user ratings, frequent ratings, and ratings
from Alexa Prize judges, with a correlation coefficient
greater than 0.75. A simple combination of the met-
rics correlated strongly with Alexa user ratings (0.66),
suggesting that the “wisdom of crowds” (Surowiecki
2004) is a reasonable approach to evaluation of con-
versational agents when conducted at scale in a nat-
ural setting. The average rating across all socialbots
was lower by 20 percent for the judge’s pool as com-
pared with the general public.
Teams also evaluated the quality of their socialbots
and made necessary improvements during the competition by leveraging the ratings and feedback from
users. Alexa users had millions of interactions and
over 100,000 hours of conversations with socialbots
throughout the duration of the competition.
Automatic Evaluation of
If we are able to build a model that can predict the
rating of an Alexa Prize conversation with reasonable
accuracy, then it is possible to remove humans from
the loop for evaluating non-task-oriented dialogues.
To automate the evaluation process, we did a preliminary analysis on 60,000 conversations and ratings, and we trained a model to predict user ratings.
We observed the Spearman and Pearson correlations
of 0.352 and 0.351 respectively (table 1) with significantly low p-value with a model trained using a gra-dient-boosted decision tree (GBDT). Although the
results for GBDT are significantly better than random
selection for five classes and the model trained using
hierarchical LSTM, there is a need to extend this
study to millions of Alexa Prize interactions. Furthermore, some of the evaluation metrics (coherence,
topical depth, topical breadth, domain coverage)
obtained at conversation level can also be used as features. With a significantly higher number of conversations combined with topical features, we hypothesize that the model would perform much better than
the results obtained in the preliminary analysis
shown in table 1. Given subjectivity in ratings, we
appropriately found interuser agreement to be quite
low for ratings analysis. Users may have their own