high performance while ignoring the input image
(Zhou et al. 2015; Jabri, Joulin, and van der Maaten
2016; Agrawal, Batra, and Parikh 2016; Kafle and
Kanan 2017). 3 Does NLVR suffer from such a bias?
NLVR is relatively balanced. Simply guessing true
gives an accuracy of 55. 4 percent on the unreleased
test set. Using only one of the modalities provides
similar results to this majority baseline. Encoding the
image only with a convolutional neural network
(CNN) to predict the truth values results in an accuracy of 55. 3 percent. Similarly, encoding the text
only with a recurrent neural network (RNN) to predict the truth value results in 56. 2 percent accuracy.
These results indicate that both the text and the
image are necessary to solve the task.
A simple baseline that uses both text and image,
however, provides disappointing results. We showed
this by concatenating the outputs of the CNN and
RNN models to predict the truth value. This model
achieves 56. 3 percent accuracy on the unreleased test
set. In contrast, the neural module networks (NMN)
approach (Andreas et al. 2016) achieves 62.0 percent.
While performance is still low, this first success at
outperforming the majority baseline is quite interesting. NMNs explicitly model compositionality. Different neural networks are composed together
according to the structure of the sentence to process
the image and generate the final prediction. The
higher performance of this model indicates that
understanding highly compositional language is necessary for solving the task.
An interesting property of NLVR is the availability
of structured representations of the images. When
generating images, we first generate a structured representation, which is then rendered to create the
image. This representation contains the complete
information about an image, including the items
contained in the boxes, their properties, and exact
positions. This representation can be considered as a
small spatial database describing the environment,
and enables experiments that do not require solving
the vision problem.
Experimenting with the structured representation
confirms an important property of the problem:
counting is a necessary skill for solving NLVR. We use
the sentence and structured representation to compute features and train a maximum entropy classifier. The classifier achieves an accuracy of 67. 8 percent
on the unreleased test set. Ablating all the features
that consider counts reduces performance on the
development set from 68.0 percent to 57. 5 percent.
This result clearly indicates the importance of counting in solving the task.
Treating the structured representation as a small
database creates an interesting opportunity for
semantic parsing techniques, where sentences are
mapped to symbolic representations (as shown, for
example, by Zelle and Mooney , Zettlemoyer
and Collins , Zettlemoyer and Collins ,
Developing systems that rely on robust language and
vision understanding requires data sets and tasks to
evaluate their performance. The goal of NLVR is to
present a challenging benchmark with linguistically
diverse language that requires complex reasoning
skills. Key to building NLVR is a carefully designed
data collection process. The goal of the process is to
challenge annotators to write sentences distinguish-
ing several images. A study of the corpus shows it is
more linguistically diverse compared to contempo-
rary corpora. Our empirical analysis illustrates key
challenges that must be addressed to solve NLVR,
including counting and compositionality. While
NLVR presents open challenges to the research com-
munity, its complexity is relatively scoped by our use
of synthetically generated images with a limited
number of shapes and properties. While we hope
that NLVR will facilitate developing models that can
better reason about vision and language, real-world
applications require studying realistic visual inputs.
An important direction we are currently pursuing is
collecting a corpus that includes real images while
preserving the complexity and diversity NLVR
demonstrates. NLVR and leaderboards for both the
image and structured representations are available. 4
2. The leaderboard is at lic.nlp.cornell.edu/nlvr.
3. Partially to address this bias, a new version of VQA was
recently released (Goyal et al. 2017).
Agrawal, A.; Batra, D.; and Parikh, D. 2016. Analyzing the
Behavior of Visual Question Answering Models. arXiv
Preprint. arXiv:1606.07356v2 [ cs.CL]. Ithaca, NY: Cornell
Andreas, J.; Rohrbach, M.; Darrell, T.; and Klein, D. 2016.
Neural Module Networks. In Proceedings of the 2016 IEEE
Conference on Conference on Computer Vision and Pattern
Recognition. Los Alamitos, CA: IEEE Computer Society.