reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Unreasonable Effectiveness of Open Science in AI: A Replication Study

Authors: Odd Erik Gundersen, Odd Cappelen, Martin Mølnå, Nicklas Grimstad Nilsen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Therefore, we performed a systematic replication study including 30 highly cited AI studies relying on original materials when available. In the end, eight articles were rejected because they required access to data or hardware that was practically impossible to acquire as part of the project. Six articles were successfully reproduced, while five were partially reproduced. In total, 50% of the articles included was reproduced to some extent. The availability of code and data correlate strongly with reproducibility, as 86% of articles that shared code and data were fully or partly reproduced, while this was true for 33% of articles that shared only data. The quality of the data documentation correlates with successful replication.
Researcher Affiliation	Collaboration	Odd Erik Gundersen1,2, Odd Cappelen1,3, Martin Møln a1,2, Nicklas Grimstad Nilsen1,2 1Norwegian University of Science and Technology, Trondheim, Norway 2Aneo AS, Trondheim, Norway 3Minus 1, Oslo, Norway EMAIL, EMAIL, EMAIL,
Pseudocode	No	The paper describes its methodologies in prose and through discussions of findings. There are no explicitly labeled sections such as "Pseudocode" or "Algorithm," nor are there structured, code-like procedural steps presented in a formal block format.
Open Source Code	Yes	Code https://github.com/AIReproducibility2018
Open Datasets	No	The paper discusses the availability and issues of datasets used in other research papers that it aims to replicate, categorizing them as R3 (data available) or R4 (code and data available). However, the paper does not provide concrete access information (link, DOI, or formal citation) for a specific dataset that was generated or directly used as input for its own replication study's methodology.
Dataset Splits	No	This paper is a replication study of other AI research. Its methodology involves evaluating the reproducibility of these external studies. Therefore, it does not involve the creation or use of a single dataset with explicit train/test/validation splits for its own experimental process. While it mentions dataset partitioning as a problem in other papers (P18), it does not provide such details for its own study.
Hardware Specification	No	We used personal computers and a high-end GPU cluster to execute experiments.
Software Dependencies	No	The paper discusses using the same programming language and third-party libraries as the original studies when reimplementing. It also mentions choosing suitable languages and substitute libraries if necessary. However, it does not provide a specific list of software dependencies (e.g., library names with version numbers) for its own experimental environment or tools.
Experiment Setup	Yes	The maximum time we spent on a reproducibility study was 40 hours of focused work. Breaks did not count towards the limit. The limit was set for practical reasons, and we considered 40 hours a reasonable effort. To some extent, we based this decision on the prediction that well-documented studies should be reproducible within this time frame. Many published articles include more than one experiment, so one reproducibility study could contain several reproducibility ex-periments. We focused on one experiment at a time. When deciding which experiment to start with, we emphasized the importance of the experiment to the article, i.e. how much it is discussed, as well as the order in which the experiments were presented. In most cases, the first eligible experiment was conducted first. Some experiments described in the articles were excluded on the basis of which material was available for exactly that experiment. For example, when deciding that an article could be reproduced at the R4 reproducibility level, only the experiments in the article covered by the provided method code were considered eligible. If after having achieved results for the first experiment and there was still time left of the 40 hours, we move on to the next eligible experiment. ... Whenever a random number generator was used, we explicitly set the seed in the code.