Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick,
C. L.; and Parikh, D. 2015. VQA: Visual Question Answering. International Journal of Computer Vision 123( 1): 4–31.
Artzi, Y., and Zettlemoyer, L. 2011. Bootstrapping Semantic
Parsers from Conversations. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing, 421–
432. Stroudsburg, PA: Association for Computational Linguistics.
Artzi, Y., and Zettlemoyer, L. 2013. Weakly Supervised
Learning of Semantic Parsers for Mapping Instructions to
Actions. Transactions of the Association of Computational Linguistics 1( 1): 49–62.
Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollár, P.; and Zitnick, C. L. 2015. Microsoft COCO Captions:
Data Collection and Evaluation Server. arXiv Preprint. arX-
iv:1504.00325 [ cs.CV]. Ithaca, NY: Cornell University
Clarke, J.; Goldwasser, D.; Chang, M.-W.; and Roth, D. 2010.
Driving Semantic Parsing from the World’s Response. In
Proceedings of the 14th Conference on Computational Natural Language Learning. Stroudsburg, PA: Association for Computational Linguistics.
Cocos, A.; Masino, A.; Qian, T.; Pavlick, E.; and Callison-Burch, C. 2015. Effectively Crowdsourcing Radiology Report
Annotations. In Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis, 109–
114. Stroudsburg, PA: Association for Computational Linguistics.
Goldman, O.; Latcinnik, V.; Naveh, U.; Globerson, A.; and
Berant, J. 2017. Weakly-Supervised Semantic Parsing with
Abstract Examples. arXiv Preprint. arXiv:1711.05240
[ cs.CL]. Ithaca, NY: Cornell University Library.
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh,
D. 2017. Making the V in VQA Matter: Elevating the Role of
Image Understanding in Visual Question Answering. In
Proceedings of the 2017 IEEE Conference on Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE
Jabri, A.; Joulin, A.; and van der Maaten, L. 2016. Revisiting
Visual Question Answering Baselines. In Computer Vision –
ECCV 2016: 14th European Conference. Lecture Notes in
Computer Science 9905. Berlin: Springer.
Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei Li, F.; Zitnick, C. L.; and Girshick, R. B. 2017. CLEVR: A Diagnostic
Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the 2017 IEEE Conference on
Computer Vision and Pattern Recognition. Los Alamitos, CA:
IEEE Computer Society.
Johnson, J.; Hariharan, B.; van der Maaten, L.; Hoffman, J.;
Fei Li, F.; Zitnick, C. L.; and Girshick, R. B. 2017. Inferring
and Executing Programs for Visual Reasoning. In Proceedings
of the IEEE International Conference on Computer Vision. Los
Alamitos, CA: IEEE Computer Society.
Kafle, K., and Kanan, C. 2017. Visual Question Answering:
Datasets, Algorithms, and Future Challenges. Computer
Vision and Image Understanding 163: 3–20.
Landis, J. R., and Koch, G. G. 1977. The Measurement of
Observer Agreement for Categorical Data. Biometrics
Lin, T.-Y.; Maire, M.; Belongie, S. J.; Hays, J.; Perona, P.;
Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
COCO: Common Objects in Context. In Computer Vision –
ECCV 2014: 13th European Conference. Lecture Notes in
Computer Science 8689. Berlin: Springer.
Suhr, A.; Lewis, M.; Yeh, J.; and Artzi, Y. 2017. A Corpus of
Natural Language for Visual Reasoning. In Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics 2, 217–223. Stroudsburg, PA: Association for Computational Linguistics.
Zelle, J. M., and Mooney, R. J. 1993. Learning Semantic
Grammars with Constructive Inductive Logic Programming. In Proceedings of the 11th National Conference on Artificial Intelligence, 817–822. Menlo Park, CA: AAAI Press.
Zettlemoyer, L. S., and Collins, M. 2005. Learning to Map
Sentences to Logical Form: Structured Classification with
Probabilistic Categorial Grammars. In Proceedings of the 21st
Conference on Uncertainty in Artificial Intelligence, 658–666.
Seattle, WA: AUAI Press.
Zettlemoyer, L. S., and Collins, M. 2007. Online Learning of
Relaxed CCG Grammars for Parsing to Logical Form. In
Proceedings of the 2007 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning, 678–687. Stroudsburg, PA: Association for
Zhou, S.; Suhr, A.; and Artzi, Y. 2017. Visual Reasoning with
Natural Language. Paper presented at the AAAI 2017 Fall
Symposium on Natural Communication for Human-Robot
Collaboration. Arlington, VA, Nov. 9–11.
Zhou, B.; Tian, Y.; Sukhbaatar, S.; Szlam, A.; and Fergus, R.
2015. Simple Baseline for Visual Question Answering. arXiv Preprint. arXiv:1512.02167 [ cs.CV]. Ithaca, NY: Cornell
Alane Suhr is a PhD student in the Department of Computer Science at Cornell Tech, Cornell University, focusing
on building agents that understand natural language
grounded in complex interactions. She is the recipient of an
AI2 Key Scientific Challenges Award, a Microsoft Research
Women’s Fellowship, a Best Paper award at ACL 2017, and
an Outstanding Paper award at NAACL 2018. Suhr received
a bachelor’s degree in computer science and engineering
from Ohio State University in 2016.
Mike Lewis is a scientist at Facebook AI Research, working
on connecting language and reasoning. Previously, he was
a postdoc at the University of Washington, developing
search algorithms for neural structured prediction. Lewis
has a PhD from the University of Edinburgh on combining
symbolic and distributed representations of meaning.
James Yeh is a software engineering at Evidation Health,
working on enabling people to participate in better health
outcomes. Yeh received his master’s in operations research
and information engineering at Cornell Tech, where he
contributed to NLP research under the supervision of Yoav
Artzi and pursued his interests in AI and using data to
enhance decision making. He received his bachelor’s degree
in applied science under the Department of Systems Design
Engineering at the University of Waterloo, where he focused
on the study of intelligent systems.
Yoav Artzi is an assistant professor in the Department of
Computer Science at Cornell Tech, Cornell University. His
research focuses on learning expressive models for natural
language understanding, most recently in situated interactive
scenarios. He received an NSF CAREER award, Best Paper
awards in EMNLP 2015 and ACL 2017, and a Google faculty
award. Artzi holds a BSc summa cum laude from Tel Aviv University and a PhD from the University of Washington.