sourced from the advisory board and participants.
This resulted in approximately 10 different metrics,
including accuracy, receiver operating characteristic
(ROC) measurements, probability calibration, and so
on. Each metric was measured on various subtasks
(such as accuracy of a particular component of the
user’s goal), and at different time resolutions (for
example, every dialog turn, just at the end, and so
on.) Every combination of these variables was measured and reported, resulting in more than 1000 measurements for each entry. The measurements themselves form a part of the research contribution: after
the first DSTC, a correlation analysis was done to
determine a small set of roughly orthogonal metrics,
which were then reported as featured metrics in
DSTC2 and DSTC3, focusing teams’ efforts. These
featured metrics were accuracy, probability quality
(Brier score), and a measure of discrimination computed from an ROC curve.
Each DSTC has been organized by an ad hoc committee, including members of the group providing
the dialog data.
Participation and Results
About nine teams have participated in each DSTC,
with global representation of the top research centers
for spoken dialog systems. Participants have mostly
been academic instutions, with a minority of corporate research labs. Results have been presented at special sessions: DSTC1 at the annual Special Interest
Group on Discourse and Dialogue (SIGdial) conference in 2013 (Williams et al. 2013); DSTC2 at SIGdial in June 2014 (Henderson, Thomson, and Williams
2014); and DSTC3 at IEEE Spoken Language Technologies (SLT) Workshop in December 2014 (
Papers describing DSTC entries have broken new
ground in dialog state tracking; the best-performing
entries have been based on conditional random fields
(Lee and Eskenazi 2013), recurrent neural networks
(Henderson, Thomson, and Young 2014), and web-style ranking (Williams 2014). At present, dialog state
trackers are able to reliably exceed the performance
of a carefully tuned hand-crafted tracker — for example, in DSTC2, the best trackers achieved approximately 78 percent accuracy versus the baseline’s 72
percent. This is impressive considering the maximum
performance possible with the provided SLU is 85
percent, due to speech recognition errors.
Prior to the DSTC series, most work on dialog state
tracking was based on generative models; however,
the most successful DSTC entries have been discriminatively trained models, and these are now the dominant approach. Thus the DSTC series has had a clear
impact on the field.
All of the DSTC data will remain available for down-
load, including labels, output from all entries, and
the raw tracker output. 1, 2 We encourage researchers
to use this data for research into dialog state tracking
or for other novel uses. In addition, a special issue to
the journal Dialogue and Discourse will feature work
on the DSTC data, and we anticipate publication in
2015. In future challenges, it would be interesting to
study aspects of dialog state beyond the user’s goal —
for example, the user’s attitude and expectation. It
would also be interesting to consider turn-taking and
state tracking of incremental dialogs, where updates
are made as each word is recognized. Finally,
researchers with dialog data available who would be
interested in organizing a future DSTC are encour-
aged to contact the authors.
For DSTC1, we thank the Dialog Research Center at
Carnegie Mellon University for providing data, and
Microsoft and Honda Research Institute for sponsor-
ship. For DSTC2 and DSTC3, we thank Cambridge
University’s dialog systems group for providing data.
We also thank our advisory committee, including
Daniel Boies, Paul Crook, Maxine Eskenazi, Milica
Gasic, Dilek Hakkani-Tur, Helen Hastie, Kee-Eung
Kim, Ian Lane, Sungjin Lee, Oliver Lemon, Teruhisa
Misu, Olivier Pietquin, Joelle Pineau, Brian Strope,
David Traum, Steve Young, and Luke Zettlemoyer.
Thanks also to Nigel Ward for helpful comments.
Black, A.; Burger, S.; Langner, B.; Parent, G.; and Eskenazi,
M. 2010. Spoken Dialog Challenge 2010. In Proceedings of
the 2010 IEEE Spoken Language Technology Workshop. Piscat-
away, NJ: Institute for Electrical and Electronics Engineers.
Henderson, M.; Thomson, B.; and Williams, J. D. 2014. The
Second Dialog State Tracking Challenge. In Proceedings of the
15th Annual SIGdial Meeting on Discourse and Dialogue.
Stroudsburg PA: Association for Computational Linguistics.
Henderson, M.; Thomson, B.; and Young, S. 2014. Word-
Based Dialog State Tracking with Recurrent Neural Net-
works. In Proceedings of the 15th Annual SIGdial Meeting on
Discourse and Dialogue. Stroudsburg PA: Association for
Jurcicek, F.; Thomson, B.; and Young, S. 2011. Natural Actor
and Belief Critic: Reinforcement Algorithm for Learning
Parameters of Dialogue Systems Modelled as POMDPs. ACM
Transactions on Speech and Language Processing 7( 3).
Lee, S., and Eskenazi, M. 2013. Recipe for Building Robust
Spoken Dialog State Trackers: Dialog State Tracking Challenge System Description. In Proceedings of the 14th Annual
SIGdial Meeting on Discourse and Dialogue. Stroudsburg PA:
Association for Computational Linguistics.