22 AI MAGAZINE
is a divide in documentation quality between indus-
try and academia, how could we reduce or remove
this gap? Based on what we know about reproduc-
ibility, should we make more detailed checklists for
peer-review that have check boxes for whether the
problem is described well enough, a hypothesis is
stated, or the code and data are shared?
If so, it will become clear what is expected from
an IJCAI or AAAI paper, and that reproducibility is
important for getting one accepted. Extending the
acceptance criteria to include items related to repro-
ducibility and making them explicit might help reduce
the gap between industry and academia. However, if
industry is required to share code or data, they might
stop presenting their results at the conferences and
journals that introduce such criteria. This is not a
desired situation, so we should avoid it. Could we
have authors register their research as R1-, R2-, or
R3-reproducible research, so that it is clear what in-
formation the papers contain? This would require
researchers to become aware of the documentation
quality of their research — if they are not already.
Also, one could imagine that a percentage of all
accepted research is set for how much of the research
could be R3- or R2-reproducible. Then, industry or
any other researchers that would or could not share
everything, could publish as much as they are able
to. This would arguably make it harder to get the
research accepted, so the incentives are to share.
To increase reproducibility of AI research, the
culture must change. The high-impact conferences
and journals have the power to make this change
together with the grant makers that fund research.
Although low-impact conferences and journals could
see the need for reproducibility as an opportunity to
get higher impact, they are afraid to scare researchers
away from them.
Increased Interest in Reproducibility
In this survey, I have analyzed papers presented at
IJCAI and AAAI between 2012 and 2016. However,
over the last few years, the AI and machine learning communities have shown increased interest in
reproducible research. A few workshops were organized before 2016, such as the Workshop on Replicability and Reusability in Natural Language Processing: From Data to Software Sharing5 at IJCAI in 2015,
which had a focus or partial focus on reproducibility.
In 2017, the workshop Reproducibility in Machine
Learning Research6 was organized at that year’s International Conference on Machine Learning, and
the workshop Enabling Reproducibility in Machine
Learning MLTrain@RML7 was held at the 2018 International Conference on Machine Learning. The Reproducibility Challenge was organized at the 2018
International Conference on Learning Representa-
tion. 8 I organized the AAAI Workshop on Reproduc-
ibility in 2019 where the participants discussed how
to improve the reproducibility of papers published
by AAAI. At AAAI 2017, the tutorial Learn to Write
a Scientific Paper of the Future: Reproducible Research,
Open Science, and Digital Scholarship, was given.
This increased interest has resulted in several very
interesting and relevant papers, of which a few are
mentioned here. Sculley et al. (2018) discuss empirical rigor and stresses its importance for work that
presents “methods that yield impressive empirical results, but are difficult to analyze theoretically” (p. 1).
Mannarswamy and Roy (2018) suggest that we need
to build AI software that can perform the verification
task given a research paper that presents a technique
and details on where to find the code and the data
used in the paper. This could help mitigate the workload of reproducing research results. Exactly such a
tool is presented by Sethi et al. (2018), who has made
software that autogenerates code from deep learning
papers with a 93-percent accuracy. Henderson et al.
(2018) show that “both intrinsic (for example, random
seeds, environment properties) and extrinsic sources
(for example, hyperparameters, codebases) of nonde-terminism can contribute to difficulties in reproducing baseline algorithms” (p. 3213).
We are not standing on each other’s shoulders. It is
more like we are standing on each other’s feet. The
quality of documentation of empirical AI research
must clearly improve.
My findings indicate that the hypothesis that industry and academic research presented at top AI conferences is equally well documented is not supported.
Academic research score higher on the three reproducibility metrics than research to which industry
has contributed. Academia also scores higher on all
three factors, but these results are not statistically
significant. Furthermore, academic research scoring
higher than the industry research is involved in 15
out of the 16 surveyed variables while the two groups
score the same on the last variable. The result is statistically significant for only three of the variables
investigated. The difference in documentation quality between industry and academia is surprising, as
the conferences use double-blind peer review and all
research is judged according to the same standards.
I discussed three barriers for individual researchers
to make research reproducible: It is time-consuming, there are no incentives, and future work is put
at risk. Some suggestions for how to overcome these
barriers were made: Infrastructure reducing the time
and effort of making research should be built, and
provided to researchers; funding sources could start
demanding researchers to make the funded research
conducted reproducible; and sharing of code and
data should be rewarded, as should making the research reproducible. Some ideas for why there is a
discrepancy between academia and industry in documentation quality were also discussed. Industry has
many incentives to not share data or code, as both