easily figure out the referent of it in sentence 1
because it is commonsense knowledge that when one
takes an object out of a container, the object’s weight
remains the same, but the container weighs less. The
human can figure out the referent of it in sentence 2
because it is commonsense knowledge that nearly all
objects are handier when they are out of, rather than
in, a bulky container like a backpack. This simple
example draws on commonsense concepts such as
weight, containment, and convenience that intelligent people typically use during their daily lives.
Sentences 1 and 2 are nearly identical except for a
pair of special words or phrases; it is the choice of the
special word or phrase —- in this case lighter / handy
—- that changes the referent of the pronoun. All
Winograd schemas have this property: this ensures
that one cannot exploit properties of the structure of
a particular sentence to guess at a pronoun’s referent
in the absence of commonsense knowledge.
The Winograd Schema Challenge Competition
consists of two tests. The first test consists of pronoun
disambiguation problems, most of which have been
collected from naturally occurring text in fiction or
nonfiction, but for which a companion schema and
associated special word or phrase are not necessarily
known. An example (from Sylvester and the Magic Peb-
ble) is as follows:
[ 3] The donkey wished a wart on its hind leg would dis-
appear, and it did. [“It” refers to “wart,” rather than
“donkey” or “leg”.]
The second test contains randomly chosen halves
of Winograd Schemas. A system takes the second test
only if it does sufficiently well on the first test. If a
system can pass both tests with a mark of at least 90
percent and no less than 5 percent worse than human
performance, it is eligible to win the challenge prize
of $25,000. The competition has been divided into
two rounds because it is more difficult to create Wino-
grad schemas manually than to collect pronoun dis-
There were six systems entered into the 2016 competition, representing four different teams. Table 1
summarizes their results. The asterisks for Quan Liu’s
three systems are due to a problem with unexpected
punctuation in XML input and that affected a handful of questions. The starred scores represent performance on the corrected XML input files.
No team did well enough on the first test to qualify for the second test, so the second test was not given. The list of problems was posted on the Commonsense Reasoning website. 1 The problems on both
parts of the competition were validated on human
subjects in advance. The human subjects achieved
better than 90 percent accuracy.
The next Winograd Schema Challenge will take
place at AAAI 2018. Further information will be available on the Commonsense Reasoning website.
Levesque, Hector J. 2011. The Winograd Schema Challenge.
In Logical Formalizations of Commonsense Reasoning:
Papers from the 2011 AAAI Spring Symposium. Technical
Report SS-11-06, 63–68.
Levesque, H. J.; Davis, E.; Morgenstern, L. The Winograd
Schema Challenge. In Principles of Knowledge Representation
and Reasoning: Proceedings of the Thirteenth International Conference (KR 2012), 552–561. Palo Alto, CA: AAAI Press.
Ernest Davis is a professor at the Courant Institute of Mathematical Sciences, New York University, New York, New
Leora Morgenstern is a principal research scientist and
technical fellow at Leidos in Arlington, Virginia.
Charles L. Ortiz is a scientist at the Laboratory for Natural
Language Processing and AI at Nuance Communications in
Contestant Number Correct Percentage Correct
Patrick Dhondt, Independent Researcher 27 45%
Denis Robert, Independent Researcher 19 31.666%
Nicos Issak, Open University of Cyprus 29 48.33%
Quan Liu ( 1), University of Science and Technology of China 28 46.9% ( 48. 33)*
Quan Liu ( 2), University of Science and Technology of China 29 48.33% ( 58. 33)*
Quan Liu ( 3), University of Science and Technology of China 27 45% ( 58. 33)*
Table 1. Competition Results.