These platforms are different from the Turing test —
and other more traditional AI evaluation benchmarks
proposed to replace it — as summarized by an AAAI
2015 workshop1 and a recent special issue of the AI
Magazine. 2 Actually, some of these platforms can integrate any task and hence in principle they supersede
many existing AI benchmarks (Hernández-Orallo 2016)
in their aim to test general problem-solving ability.
This topic has also attracted mainstream attention.
For instance, the journal Nature recently featured a
news article on the topic (Castelvecchi 2016). In summary, a new and uncharted territory for AI is emerging, which deserves more attention and effort within
AI research itself.
In this report, we first give a short overview of the
new platforms, and briefly report about two 2016
events focusing on (general-purpose) AI evaluation
(using these platforms or others).
New Playground, New Benchmarks
Many different general-purpose benchmarks and
platforms have recently been introduced, and they
are increasingly adopted in research and competitions to drive and evaluate AI progress.
The Arcade Learning Environment3 is a platform
for developing and evaluating general AI agents using
a variety of Atari 2600 games. The platform is used to
compare, among others, approaches such as RL (see,
for example, Mnih et al ), model learning,
model-based planning, imitation learning, and transfer learning. A limitation of this environment is the
reduced number of games, leading to overspecialization. The video game definition language (VGDL) 4
follows a similar philosophy, but new two-dimensional (2D) arcade games can be generated using a
flexible set of rules.
OpenAI Gym5 (Brockman et al. 2016) provides a
diverse collection of RL tasks and an open-source
interface for agents to interact with them, as well as
tools and a curated web service for monitoring and
comparing RL algorithms. The environments, formalized as partially observable Markov decision
processes, range from classic control and toy text to
algorithmic problems, 2D and three-dimensional
(3D) robots, as well as Doom, board, and Atari games.
OpenAI Universe6 is a software platform intended
for training and measuring the performance of AI systems on any task where a human can complete with a
computer, and in the way a human does: looking at
screen pixels and operating a (virtual) keyboard and
mouse. In Universe, any program can be turned into a
Gym environment, including Flash games, browser
tasks, and games like slither.io and GTA V. The current
release consists of 1000 environments ready for RL.
Microsoft’s Project Malmo7 (Johnson et al. 2016)
gives users complete freedom to build complex 3D
environments within the block-based world of the
Minecraft video game. It supports a wide range of
experimentation scenarios for evaluating RL agents
and provides a playground for general AI research.
Tasks range from navigation and survival to collabo-
ration and problem solving.
GoodAI’s Brain simulator8 and school is a collaborative platform to simulate artificial brain architectures using existing AI modules, like image recognition and working memory.
DeepMind Lab9 is a highly customizable and extensible 3D gamelike platform for agent-based AI
research. Agents operate in 3D environments using a
first-person viewpoint and can be evaluated over a
wide range of planning and strategy tasks, from maze
navigation to playing laser tag. Somewhat similarly,
the ViZDoom (Kempka et al. 2016) research platform
allows RL agents to interact with customizable scenarios in the world of the 1993 first-person shooting
video game Doom using only the screen buffer.
Facebook’s TorchCraft (Synnaeve et al. 2016) is a
library enabling machine-learning research on real-time strategy games. The high-dimensional action
space of these games is quite different from those previously investigated in RL research and provides a
useful bridge to the richness of the real world. To execute something as simple as “attack this enemy base,”
one must coordinate mouse clicks, camera, and available resources. This makes actions and planning hierarchical, which is challenging in RL. TorchCraft’s current implementation connects the Torch machine
learning library to StarCraft: Brood War, but the same
idea can be applied to any video game and library.
Meanwhile, DeepMind is also collaborating with Blizzard Entertainment to open up StarCraft II as a testing environment for AI research.
Facebook’s CommAI-env10 (Mikolov, Joulin, and
Baroni 2015) is a platform for training and evaluating
AI systems from the ground up, to be able to interact
with humans through language. An AI learner interacts in a communication-based setup through a bit-level interface with an environment that asks the
learner to solve tasks presented with incremental difficulty. Some tasks currently implemented include
counting problems, memorizing lists and answering
questions about them, and navigating from text-based instructions.
The introduction of these platforms offers many
new possibilities for AI evaluation and experimentation, but it also poses many questions about how
benchmarks and competitions can be created using
such platforms, especially if the goal is to assess more
general AI. Two new venues were set up to explore
these issues in 2016, as we discuss next.
The Evaluating General-
Purpose AI Workshop
The 2016 Workshop on Evaluating General-Purpose
AI (EGPAI 2016) 11 was the first workshop focusing on
the evaluation of general-purpose artificial intelli-