Figure 1. The Dialog State Tracking Problem.
The left column shows the actual dialog system output and user input. The second column shows two SLU n-best hypotheses and their
scores. The third column shows the label (correct output) for the user’s goal. The fourth column shows example tracker output, and the
fifth column indicates correctness.
S: Which part of town?
U: The north uh area
0.2 inform(food=north_african) area=north
Actual Input and Output SLU Hypotheses and Scores Labels Example Tracker Output Correct?
S: Which part of town?
U: A cheap place in the north
S: Clown café is a cheap
restaurant in the
north part of town.
U: Do you have any
others like that,
maybe in the south
part of town?
0.7 reqalts (area=south) area=south
The dialog state tracking challenge studies this prob-
lem as a corpus-based task. When the challenge
starts, labeled human-computer dialogs are released
to teams, with scripts for running a baseline system
and evaluation. Several months later, a test set of
unlabeled dialogs is released. Participants run their
trackers, and a week later they return tracker output
to the organizers for scoring. After scoring, results
and test set labels are made public.
The corpus-based design was chosen because it
allows different trackers to be evaluated on the same
data, and because a corpus-based task has a much
lower barrier to entry for research groups than building an end-to-end dialog system. However when a
tracker is deployed, it will inevitably alter the performance of the dialog system it is part of relative to
any previously collected dialogs. In order to simulate
this mismatch at training time and at run time, and
to penalize overfitting to known conditions, dialogs
in the test set are conducted using a different dialog
manager, not found in the training data.
The first DSTC used 15,000 dialogs between real
Pittsburgh bus passengers and a variety of dialog systems, provided by the Dialog Research Center at
Carnegie Mellon University (Black et al. 2010). The
second and third DSTCs used in total 5,510 dialogs
between paid Amazon Mechanical Turkers, who were
asked to call a tourist information dialog system and
find restaurants that matched particular constraints,
provided by the Cambridge University Dialogue Systems Group (Jurcicek, Thomson, and Young 2011).
Each DSTC added new dimensions of study. In the
first DSTC, the user’s goal was almost always fixed
throughout the dialog. In the second DSTC, the
user’s goal changed in about 40 percent of dialogs.
And the third DSTC further tested the ability of trackers to generalize to new domains by including entity
types in the test data that were not included in the
training data — for example, the training data
included only restaurants, but the test data also
included bars and coffee shops.
In this relatively new research area, there does not
exist a single, generally agreed on evaluation metric;
therefore, each DSTC reported a bank of metrics,