Indeed, significant efforts have been directed
toward developing data sets that pose vision and lan-
guage challenges (Antol et al. 2015; Chen et al.
2015). A key focus in existing resources has been
diverse and realistic visual stimuli. For example, the
Visual QA (VQA) data set (Antol et al. 2015) includes
265K COCO images (Lin et al. 2014), which contain
dozens of object categories and over a million object
instances. Questions were collected via crowdsourc-
ing by asking workers to write questions given these
images. While the collected questions are often chal-
lenging, answering them requires relatively rudi-
mentary reasoning beyond the complex grounding
problem. Understanding how well proposed
approaches handle complex reasoning, including
resolving numerical quantities, comparing sets, and
reasoning about negated properties, remains an open
We address this challenge with the Cornell Natural Language Visual Reasoning (NLVR) data set (Suhr
et al. 2017; Zhou, Suhr, and Artzi 2017). NLVR focuses on the problem of understanding complex, linguistically diverse natural language statements that
require significant reasoning skills to understand. We
design a simple task: given an image and a statement,
the system must decide if the statement is true with
regard to the image. Similar to VQA, and unlike caption generation, this binary classification task allows
for straightforward evaluation. Figure 2 shows two
examples from our data.
We use synthetic images to control the visual
input during data collection. Each image shows an
environment divided into three boxes. Each box contains various objects, either scattered about or
stacked on one another. We use a small set of objects
with few properties. This restriction enables us to
simplify the recognition problem, and instead focus
on reasoning about sets, counts, and spatial relations.
The grouping into three sets is designed to support
descriptions that contain set-theoretic language and
The key challenge is collecting natural language
descriptions that take advantage of the full complexity of the image, rather than focusing on simple
properties, such as the existence of one object or
another. The images support rich descriptions that
include comparisons of sets, descriptions of spatial
relations, counting of objects, and comparison of
their properties. But how do we design a scalable
process to collect such language?
Collecting the Data
We use crowdsourcing to collect descriptions from
nonexperts. The key challenge is defining a task that
will require the complexity of reasoning we aim to
reflect. If we display a single image, workers will eas-
ily complete the task with sentences that contain
simple references (for example, “there is a yellow tri-
angle”). A key observation that underlies our process
design is that discriminating between similar images
is significantly harder and requires more complex
reasoning. Furthermore, if instead of discriminating
between images, the worker is asked to discriminate
between sets of images, the task becomes more com-
plex, and therefore requires the language to capture
even finer distinctions.
These observations are at the foundation of a simple, yet surprisingly effective, data collection process.
We generate four images to collect a description. We
first generate two images separately by randomly
sampling the number of objects and their properties.
For each of the two images, we generate an additional image by shuffling the objects across the image.
This gives us two pairs. The first pair includes the ini-
Figure 1. An Example Observation and
Take four of the larger plates from the middle shelf
Instruction Given to a Household Assistance Robot.
and put them on the table.