SUMMER 2018 47
tial randomly generated images. The second pair is
made up of their shuffled versions. We then ask the
worker to write a sentence that is true for each of the
images in the first pair, but false for each of the
images in the second pair. To complete this task,
workers must identify similarities between the first
pair of images that do not hold for the second pair.
The complexity of the task encourages language that
expresses complex reasoning. Generating the second
pair of images by shuffling the objects in the first pair
prevents sentences that simply state the presence of
a specific object. Our task encourages workers to
write linguistically diverse and complex sentences by
juxtaposing images that are similar to one another,
yet contain minor differences.
We also asked the workers to follow two addition-
al constraints. First, the sentence should not contain
references to the image labels themselves. Second,
the sentence should not refer to the horizontal order
of the boxes. Treating the image as a set of three
unordered boxes encourages set-theoretic descrip-
tions. In addition to improving the language collect-
ed, these two constraints also allow us to generate a
large number of examples. We can divide the results
of each task into four independent image-sentence
pairs, and then generate six images for each labeled
sentence-image pair by permuting the boxes while
maintaining the description’s truth value. Figure 3
shows the prompt that was presented to the user.
This process already provides high-quality data.
We quantify it by measuring agreement among
annotators asked to solve the task given a sentence-
image pair. We present annotators with an image and
a sentence, and ask them to judge whether the sen-
tence is true or false about the image. We also allow
workers to mark examples as invalid. To validate the
constraints, we randomly permute the boxes in the
image before displaying it to the user. To compute
agreement, we collect five judgments for examples in
the development and test sets, and compute Krip-
pendorf’s α and Fleiss’ κ (Cocos et al. 2015), two com-
mon agreement statistics. Our process yields α =
0.768 and κ = 0.709, indicating substantial agree-
ment (Landis and Koch 1977). We further increase
the data quality by pruning examples that were
marked as invalid and, for development and test
Figure 2. Example Sentences and Images from NLVR.
Each image includes three boxes with different object types. The truth value of the top sentence is true, while the bottom is false.
Each box has at least 1 black item.
There are exactly two black squares not touching any edge.