examples with multiple labels, where disagreement is
high. In practice, the pruning process removes only
3. 3 percent of the original data. However, it increases agreement to α = 0.831 and κ = 0.808.
We use the crowdsourcing platform Upwork. 1 We
collected 92,244 image-sentence pairs labeled with
whether the sentence is true about the image. The
data contains 3,962 unique sentences. We split the
data into four sets: a training set containing 80. 7 percent of examples, a development set containing 6. 4
percent of examples, and two test sets each containing 6. 4 percent of examples. We keep one test set as
unreleased, and use it to maintain a leaderboard. 2 We
invite everyone working on the data to submit their
models for evaluation on the unreleased set. Performance on the unreleased set is listed on the public leaderboard.
What Kind of Data Do We Get?
Our goal with NLVR is representation of linguistic
diversity and complex reasoning. In an attempt to
gain insight into the data we collected, we perform
linguistic analysis of the data and compare our findings with several existing, related corpora. Our comparison focuses on VQA (Antol et al. 2015), which
contains natural language questions about real photographs and synthetic abstract images; Microsoft
COCO Captions (Chen et al. 2015), which contains
natural language captions of photographs; and
CLEVR (Johnson et al., CLEVR: A Diagnostic Dataset,
2017; Johnson et al., Inferring and Executing Programs, 2017), which contains both synthetic
(CLEVR) and, more recently, human-written (
CLEVR-Humans) questions about synthetic images.
We observe that sentence length in NLVR follows
a similar distribution to the captions of Microsoft
COCO Captions (figure 4). Longer sentences are
often more challenging to understand, and display
more compositionality. NLVR sentences are on average longer than those in VQA and CLEVR-Humans,
but shorter than the synthetic sentences of CLEVR.
We suspect the synthetic CLEVR sentences are longer
due to the setup of the generation process.
We also study the presence of various linguistic
phenomena in NLVR and the related corpora. This
analysis is key to understanding the linguistic diversity and the type of reasoning required to solve the
task. For example, a corpus with no reference to
numbers or comparatives is likely not to require
much cardinal reasoning. We choose 12 linguistic
features directly related to our original goals, including counting, references to sets, and spatial relations.
Table 1 lists the features we study, along with examples and their frequency in NLVR, VQA, and CLEVR-Humans. We analyze 200 examples in each corpus.
We find that NLVR is remarkably diverse when compared to existing vision and language resources. For
10 out of the 12 categories, it shows higher representation than VQA. Even when compared to CLEVR-Humans, which was designed with a similar goal of
benchmarking visual reasoning, NLVR contains more
occurrences for 9 of the 12 features.
Biases, Baselines, and Challenges
The linguistic complexity of NLVR indicates that a
variety of skills are required to solve the task. But
how challenging is NLVR to existing methods? And
what can we learn about the corpus from the per-
Figure 4. Distribution of Sentence Lengths.
5 10 15 20 25 30 35 40