different types of inappropriate and offensive speech
We identified several (potentially overlapping)
classes of inappropriate responses: ( 1) profanity, ( 2)
sexual responses, ( 3) racially offensive responses, ( 4)
hate speech, ( 5) insulting responses, and ( 6) violent
responses (inducements to violent acts or threatening responses). We explored keyword- and pattern-matching strategies, but these strategies are subject
to poor precision (with a broad list) or poor recall
(with a carefully curated list), as inappropriate
responses may not necessarily contain profane or
other blacklisted words. We tested a variety of support vector machines and Bayesian classifiers trained
on n-gram features using labeled ground truth data.
The best accuracy results were in profanity (>97 percent at 90 percent recall), racially offensive responses ( 96 percent at 70 percent recall), and insulting
responses ( 93 percent at 40 percent recall). More
research is needed to develop effective offensive
speech filters. In addition to dataset cleansing, an
offensive speech classifier is also needed for online
filtering of candidate socialbot responses prior to out-putting them to ASK for text-to-speech conversion.
Addressing Problems in Evaluating
Social conversations are inherently open ended. For
example, if a user asks the question “What do you
think of Barack Obama?,” there can be thousands of
distinct, valid, and reasonable responses. That is, the
response space is unbounded for open-domain con-
versations. This makes training and evaluating social,
non-task-oriented, conversational agents extremely
challenging. It is easier to evaluate a task-oriented
dialogue system because we can measure systems by
successful completion of tasks, which is not the case
with open-ended systems. As with human-to-human
dialogues, an interlocutor’s satisfaction with a social-
bot could be related to how engaging, coherent, and
enjoyable the conversation was. The subjectivity
associated with evaluating conversations is a key ele-
ment underlying the challenge of building non-goal-
oriented dialogue systems.
This problem has been heavily studied but lacks a
widely agreed-upon metric. A well-designed evalua-
tion metric for conversational agents that addresses
the above concerns would be useful to researchers in
this field. There is significant previous work on eval-
uating goal-oriented dialogue systems. Two of those
notable earlier works are TRAINS system and PARA-
DISE (Walker et al. 1997). All of these systems involve
some subjective measures that require a human in
the loop. Due to the expensive nature of human-
based evaluation procedures, researchers have been
using automatic machine translation (MT) metrics,
such as BLEU, or text summarization metrics, such as
ROUGE, to evaluate systems. But as shown by Liu et
al. (2017), these metrics do not correlate well with
The Turing Test (Turing 1950) is a well-known test
that can potentially be used for dialogue evaluation.
However, we do not believe that the Turing Test is a
suitable mechanism to evaluate socialbots for the following reasons:
Incomparable elements: Given the amount of knowledge an AI has its disposal, it is not reasonable to suggest that a human and an AI should generate similar
responses. A conversational agent may interact differently from a human, but may still be a good conversationalist.
Incentive to produce plausible but low-information content
responses: If the primary metric is just generation of
plausible human-readable responses, it is easy to opt
out of the more challenging areas of response generation and dialogue management. It is important to be
able to source interesting and relevant content while
generating plausible responses.
Misaligned objectives: The goal of the judge should be
to evaluate the conversational experience, not to
attempt to get the AI to reveal itself.
To address these issues, we propose a comprehensive, multimetric evaluation strategy designed to
reduce subjectivity by incorporating metrics that correlate well with human judgement. The proposed
metrics provide granular analysis of the conversational agents, which is not captured in human ratings. We show that these metrics can be used as a reasonable proxy for human judgment. We provide a
mechanism to unify the metrics for selecting the top
performing agents, which has also been applied
throughout the Alexa Prize competition. The following objective metrics (Guo et al. 2017; Venkatesh et
al. 2017) have been used for evaluating conversational agents. The proposed metrics also align with
the goals of a socialbot, that is, the ability to converse
coherently and engagingly about popular topics and
Conversational user experience (CUX): Different users
have different expectations concerning the socialbots,
and so their experiences might vary widely since
open-domain dialogue systems involve subjectivity.
To address these issues, we used average ratings from
frequent users as a metric to measure CUX. With multiple interactions, frequent users have their expectations established and they evaluate a socialbot in comparison to others.
Coherence: We annotated hundreds of thousands of
randomly selected interactions for incorrect, irrelevant, or inappropriate responses. With the annotations, we calculated the response error rate (RER) for
each socialbot, using that figure to measure coherence.
Engagement: Evaluated through performance of conversations identified as being in alignment with
socialbot goals. Measured using duration, turns, and
ratings obtained from engagement evaluators (a set of
Alexa users who were asked to evaluate socialbots
based on engagement).