2013 2014 2015 2016
Figure 3. Change Over Time of the Three Reproducibility Metrics for
Selected Years of the Two Conferences AAAI and IJCAI.
Gundersen and Kjensmo (2018).
Method Data Experiment
Figure 4. The Three degrees of Reproducibility Are Defined by Which
Documentation Is Used to Reproduce the Results.
The three degrees of reproducibility each require a different set of factors to
papers from the AAAI 2014, AAAI 2016, IJCAI 2013,
and IJCAI 2016 conferences. Among these, 325 papers
describe empirical studies, while the remaining 75
papers do not. Figure 1 displays the percentage of the
surveyed papers that documented the different variables, while figure 2 summarizes how many of the
variables were documented for each factor per paper.
We make a few observations. As seen in figure 1,
few of the papers explicitly mention the research
method that is used, and only around half explicitly
mention which problem is being solved. Only about
a third of the papers share the test dataset and only 4
percent share the result produced by the AI program.
Only 8 percent of the papers share the source code of
the AI method that is being investigated, while only
5 percent explicitly specify the hypothesis and 1 percent specify their prediction. Figure 2 shows that 67
papers do not explicitly document any of the variables for the factor method; only one paper documents and shares training, validation, and test sets as
well as the results; and approximately 90 percent of
the papers document no more than three of the seven variables of the factor experiment.
As seen in figure 3, the trends are unclear. Statistical analysis showed that only two of the metrics, R1D
and R2D, for IJCAI had a statistically significant
increase over time. While R2D and R3D for AAAI
decrease over time, the decrease is not statistically
The study by Gundersen and Kjensmo (2018) has
some limitations. For example, the study required
that for the variable problem to be set to yes (true),
the paper must explicitly state the problem that is
being solved. Another shortcoming is that all the AI
methods that are documented in the research papers
are not necessarily described better with pseudocode
than without, but this fact was not given any consideration. If a paper described an AI method and
pseudocode was not provided, the pseudocode variable was set to false for that paper. Finally, some of the
variables might be redundant (for example, problem,
goal, or research questions). Still, despite these shortcomings, the findings indicate that computational AI
research is not documented systematically and with
enough information to support reproducibility.
Degrees of Reproducibility
Gundersen and Kjensmo (2018) distinguish between
three degrees of reproducibility, which are defined as
R1: Experiment Reproducible. The results of an experiment are experiment reproducible when the execution of the same implementation of an AI method
produces the same results when executed on the same
data. This is often called repeatability.
R2: Data Reproducible. The results of an experiment are
data reproducible when an experiment is conducted
that executes an alternative implementation of the AI
method that produces the same results when execut-
ed on the same data. This is often called replicability.
R3: Method Reproducible. The results of an experiment
are method reproducible when the execution of an
alternative implementation of the AI method produces consistent results when executed on different
data. This is often called reproducibility.
Empirical research that is R1 (experiment reproducible) must document the AI method, the data
used to conduct the experiment, and the experiment
itself including the source code for the AI method
and the experiment setup, while R2 (data reproducible) research must only document the AI
method and the data. R3 (method reproducible)
research must only document the AI method. Figure
4 illustrates the different factors that must be documented for the three reproducibility degrees.