68 AI MAGAZINE
The most relevant outcome of this workshop was
the identification of the challenging and urgent
demands relevant to general-purpose AI evaluation,
such as understanding the relation between tasks (or
classes of tasks), the notion of (task and environ-
ment) difficulty, and the relevance of how observa-
tions are presented to AI agents, including rewards
and penalties. The workshop also served to illustrate
how several algorithms compare in terms of their
The Machine Intelligence Workshop
The Machine Intelligence Workshop12 held at the
December 2016 Conference on Neural Information
Processing Systems (NIPS 2016) focused on the paral-
lel questions of what is general AI and how to evalu-
ate it. Concerning evaluation, there was a general
agreement that we need to test systems for their abil-
ity to tackle new tasks that they did not encounter in
their training phase. The speakers also agreed that an
important characteristic to be tested is the degree to
which systems are compositional, in the sense that
they can creatively recompose skills that they have
learned in previous tasks to solve a new problem.
Some speakers argued for tasks to be defined from
first principles in a top-down manner, whereas others
suggested looking at nature (humans and other intel-
ligent beings) for inspiration in formulating the tasks
(with further discussion on whether the inspiration
should come from ontogenesis or phylogenesis).
The role of human language was also debated, with
some speakers stressing that it is hard to conceive of
useful AI without a linguistic communication channel, while others pointed to animal intelligence as a
more realistic goal, and to possible applications for
AI and Evaluation — The Future
A recurrent issue in general intelligence evaluation is
based on the old view of intelligence as the capabili-
ty to succeed in a range of tasks or, ultimately, per-
forming relatively well in all possible tasks. Never-
theless, the notion of all tasks is meaningless if the
concept is not accompanied by a probability distri-
bution. While Legg and Hutter (2007) advocate a dis-
tribution based on Solomonoff’s universal prior on
task descriptions (higher probability to tasks of short
encoding), Hernández-Orallo (2017) advocates a dis-
tribution based on task difficulty (measuring difficul-
ty as the complexity of the simplest solution for each
task, and ensuring solution diversity for each diffi-
culty). Alternative distributions could be derived
from the set of tasks that humans and other animals
face on a daily basis.
When compared to these theoretical distributions,
can we say anything about the distribution of tasks
that compose any of the new platforms? Is their actual diversity really covering general abilities? And what
about their properties with respect to transfer, or
As more tasks are integrated, different universes of
tasks are created and the whole set of tasks in all platforms configure the cosmos for AI. At present, this is
just an unstructured collection of tasks with no clear
criteria for inclusion, exclusion, or relative weight.
This bears similarity to the early years of psychometrics (among other disciplines) that have been dealing
with behavioral evaluation for over a century, putting
some order in the space of tasks and abilities.
To move ahead, the space of tasks must be analyzed. This can be done in terms of a hierarchy linking tasks and abilities (Hernández-Orallo 2017) or in
terms of a task theory (Thórisson et al. 2016), using
theoretical approaches to task similarity and difficulty, or a more empirical strategy, by analyzing the
results of a population of AI systems with item
response theory (IRT) or other psychometric techniques (De Ayala 2009).
In summary, evaluation is becoming crucial in AI
and will become much more sophisticated and relevant in the years to come. New events in 2017,
including challenges (such as the General AI challenge13), competitions, and workshops, such as the
Evaluating General-Purpose AI 2017 workshop14 at
IJCAI 2017), will delve much further into how general-purpose AI should be evaluated now and in the
1. See the AAAI 2015 workshop, Beyond the Turing Test,
chaired by Gary Marcus, Francesca Rossi, and Manuela
2. See the spring 2016 issue of AI Magazine, volume 37, number 1. The 13 special issue articles were edited by Gary Marcus, Francesca Rossi, and Manuela Veloso.
3. See the Arcade Learning Environment website
4. See www.gvgai.net/vgdl.php.
5. See gym.openai.com.
6. See blog.openai.com/universe.
7. See www.microsoft.com/en-us/research/project/project-malmo.
8. See www.goodai.com/brain-simulator.
9. See deepmind.com/blog/open-sourcing-deepmind-lab.