Documentation and Reproducibility Scores

To quantify reproducibility at scale, we define two complementary measures: Documentation Score and Reproducibility Score. Documentation Score captures the extent to which a paper reports information that is relevant for independent reproduction, while the Reproducibility Score estimates the probability that a study can in practice be reproduced from the available artifacts provided by the authors. Together, these measures provide complementary information on reproducibility-relevant documentation and the estimated reproducibility likelihood of the research itself.

These two metrics build on the reproducibility types and degrees introduced by [1]. Reproducibility types describe what documentation is used when reproducing an experiment: a textual description only (R1), code and description (R2), data and description (R3), or the complete experiment (R4). Reproducibility degrees describe the level at which results agree: an experiment is outcome reproducible if the reproduced outcome is identical (for example, same class), analysis reproducible if a different outcome yields the same conclusion under the same analysis, and interpretation reproducible if a different analysis still supports the same conclusion. The Documentation Score measures reporting practices along the R1–R4 axis, while the Reproducibility Score estimates the likelihood of reproduction along the outcome analysis dimension using observed reproduction rates under different artifact-sharing conditions [2].

Documentation Score

The Documentation Score \(D_p \in [0, 7]\) is based on seven reproducibility variables: pseudocode, open code, open data, dataset splits, software dependencies, hardware specifications, and experiment details. These variables were derived from the broader framework introduced by [3] and narrowed to a subset that could be identified accurately and consistently at scale using an LLM [4]. The selection was motivated by two considerations: each variable captures information that is substantively important for reproducing machine learning research, and each can be detected with sufficient reliability in paper text to support large-scale automated analysis. For each paper, the LLM assigns a value of 1 if a variable is present and 0 otherwise. The Documentation Score is the equally sum of these seven binary indicators.

We motivate this approach mainly with the following three reasons. First, the selected variables reflect core dimensions of experimental transparency. Access to code and data directly reduces barriers to reproduction, while information about dataset splits, software dependencies, hardware, and experimental details reduces ambiguity surrounding the acquisition of results. Pseudocode further helps clarify the intended method when implementation details are not fully recoverable from the text alone. Second, prior work has shown that these factors are closely related to whether AI research can be reproduced in practice; [5] found that pseudocode, readability, and specification of hyperparameters were among the features most predictive of a successful independent replication. Differences in software framework [6], [7], software versions [8], [9], [10], [11], ancillary software [8], [9], [10], [12], [13], processing units [8], [10], [14], [15], random seeds [12], [16], [17], [18], [19], and dataset splits [16], [20], [21] have each been shown to produce substantially different results when reproducing experiments. Third, a restricted set of variables is appropriate in a large-scale study; the purpose of the Documentation Score is not to exhaustively characterise all determinants of reproducibility, but to measure a reproducibility-relevant subset of reporting practices that can be observed consistently across the vast majority of published papers. [4] confirmed F1 scores exceeding 90% for most variables on an independently labelled evaluation dataset of 160 papers. In this sense, the Documentation Score should be interpreted as a measure of documentation quality with respect to reproducibility, rather than a direct measure of successful reproduction.

Reproducibility Score

The Reproducibility Score \(R_p \in [0, 1]\) complements the Documentation Score by estimating the extent to which a collection of papers is reproducible in practice. It is based on empirical results reported by [2], who found that papers sharing both open code and data could be reproduced 86% of the time, whereas papers sharing open data but not code could be reproduced 33% of the time. Papers without shared data were not attempted and are assigned a score of 0. Following this approach, we assign each paper a score according to:

\[\begin{equation} R_p = \begin{cases} 0.86 & \text{if both open code and data are available,} \\ 0.33 & \text{if open data is available but code is not,} \\ 0 & \text{if open data is not available.} \end{cases} \label{eq:rep-score} \end{equation}\]

The Reproducibility Score for a set of papers is the average of these values and therefore ranges from 0 to 1. Note that with the current rule set of [eq:rep-score] the range is effectively from 0 to 0.86 as no paper can be awarded currently based on the empirical data.

This formulation is preferable to simpler additive alternatives because it is empirically calibrated rather than normatively assigned. A scheme that awards one point for open code and one for open data would measure artifact sharing, but it would not estimate reproducibility. The present weighting reflects observed differences in reproduction outcomes under different sharing conditions. It also captures an important asymmetry between code and data. In many machine learning settings, the absence of open data is a more severe obstacle than the absence of code, because using a different dataset introduces its own sources of irreproducibility: dataset bias from collection methods [13], [22], [23], [24], [25], inconsistencies introduced by pre-processing [26] and annotation quality [27], and differences in how data is split across training, validation, and test sets [7], [16], [20], [21], [28]. At the same time, sharing data without code does not ensure reproducibility, since reproduction may still fail because of missing algorithmic and implementation details such as hyperparameter choices, optimisation procedures [6], [16], [18], [21], initialisation [12], [19], random seeds [12], [16], [17], [18], [19], or undocumented engineering decisions. The Reproducibility Score therefore reflects both the central role of dataset availability and the additional contribution of code availability to successful reproduction.

At the same time, this score should be interpreted with appropriate caution. The underlying empirical estimates are based on a relatively small sample of reproduction attempts [2] and should therefore be understood as approximate probabilities rather than universal constants. Nevertheless, they provide a more defensible basis for large-scale estimation than arbitrary equal weighting, which, for the purposes of the AI Reproducibility Index, is an important advantage. The objective of the Index is to compare venues, countries, and institutions systematically with respect to the extent to which their research is documented and reproducible. The Documentation Score serves this objective by measuring the presence of key reproducibility-related reporting practices, while the Reproducibility Score translates the most consequential artifact-sharing practices into an empirically grounded estimate of reproducibility potential.

Entity-Level Attribution and Ranking

We distribute credit for each paper proportionally among its authors’ affiliated entities, following the fractional counting approach [29], [30]. In the description below, we use the term entity to refer to either a country or an institution. If an author has more than one affiliation, only the first affiliation is used for attribution.

For a paper with N authors, each author’s affiliated entity receives a share equal to \(1/N\) of that paper’s score. For example, a paper with four authors, two from entity A and one each from entities B and C, assigns half the paper’s score to entity A and one quarter each to entities B and C. Authors with unknown affiliations are retained in the denominator to prevent artificial score inflation. The entity’s overall score is the weighted mean of these fractional contributions across all papers to which it contributed. Only entities with at least 25 absolute contributing papers are included for countries and at least 100 for institutions. Entities are ranked by their mean weighted score.

Because entities with few contributing papers may exhibit extreme scores due to variance alone, we account for statistical uncertainty using a robust standard error that applies Bessel’s correction based on the unweighted count of distinct papers per entity. We tested the score distributions for normality [31], [32] and rejected the null hypothesis of normality. We therefore report 95% confidence intervals using the bias-corrected and accelerated (BCa) bootstrap method [33] with 10,000 resamples, which does not assume normality of the underlying distribution.

[1]

O. E. Gundersen, “The fundamental principles of reproducibility,” Philosophical Transactions of the Royal Society A, vol. 379, no. 2197, p. 20200210, 2021.

[2]

O. E. Gundersen, O. Cappelen, M. Mølnå, and N. G. Nilsen, “The unreasonable effectiveness of open science in AI: A replication study,” in Proceedings of the 39th AAAI Conference on Artificial Intelligence, 2025, pp. 26211–26219. doi: 10.1609/aaai.v39i25.34818.

[3]

O. E. Gundersen and S. Kjensmo, “State of the art: Reproducibility in artificial intelligence,” in Proceedings of the 32nd AAAI conference on artificial intelligence, 2018, pp. 1644–1651. doi: 10.1609/aaai.v32i1.11503.

[4]

Anonymous, “The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers,” Under Review.

[5]

E. Raff, “A step toward quantifying independently reproducible machine learning research,” vol. 32, pp. 1–11, 2019, Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/c429429bf1f2af051f2021dc92a8ebea-Paper.pdf

[6]

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Proceedings of the thirty-second AAAI conference on artificial intelligence, 2018, pp. 3207–3214.

[7]

L. Pouchard, Y. Lin, and H. Van Dam, “Replicating machine learning experiments in materials science,” in Parallel computing: Technology trends, vol. 36, in Advances in parallel computing, vol. 36., IOS Press, 2020, pp. 743–755. doi: 10.3233/APC200105.

[8]

K. Coakley, C. R. Kirkpatrick, and O. E. Gundersen, “Examining the effect of implementation factors on deep learning reproducibility,” in Proceedings of the IEEE 18th international conference on e-science (e-science), IEEE, 2022, pp. 397–398.

[9]

M. Crane, “Questionable answers in question answering research: Reproducibility and variability of published results,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 241–252, 2018.

[10]

O. E. Gundersen, S. Shamsaliei, and R. J. Isdahl, “Do machine learning platforms provide out-of-the-box reproducibility?” Future Generation Computer Systems, vol. 126, pp. 34–47, 2022, doi: https://doi.org/10.1016/j.future.2021.06.014.

[11]

M. Shahriari, R. Ramler, and L. Fischer, “How do deep-learning framework versions affect the reproducibility of neural network models?” Machine Learning and Knowledge Extraction, vol. 4, no. 4, pp. 888–911, 2022.

[12]

H. V. Pham et al., “Problems and opportunities in training deep learning software systems: An analysis of variance,” in Proceedings of the 35th IEEE/ACM international conference on automated software engineering, New York, NY, USA: Association for Computing Machinery, 2020, pp. 771–783. doi: 10.1145/3324884.3416545.

[13]

J. Pineau et al., “Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program),” Journal of Machine Learning Research, vol. 22, no. 1, 2021.

[14]

S.-Y. Hong et al., “An evaluation of the software system dependency of a global atmospheric model,” Monthly Weather Review, vol. 141, no. 11, pp. 4165–4172, 2013, doi: https://doi.org/10.1175/MWR-D-12-00352.1.

[15]

P. Nagarajan, G. Warnell, and P. Stone, “The impact of nondeterminism on reproducibility in deep reinforcement learning,” in 2nd reproducibility in machine learning workshop at ICML 2018, Stockholm, Sweden, 2019.

[16]

X. Bouthillier et al., “Accounting for variance in machine learning benchmarks,” Proceedings of the Fourth Conference on Machine Learning and Systems, vol. 3, pp. 747–769, 2021.

[17]

G. Melis, C. Dyer, and P. Blunsom, “On the state of the art of evaluation in neural language models,” in Proceedings of the sixth international conference on learning representations, 2018. Available: https://openreview.net/forum?id=ByJHuTgA-

[18]

N. Reimers and I. Gurevych, “Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging,” in Proceedings of the 22nd conference on empirical methods in natural language processing, M. Palmer, R. Hwa, and S. Riedel, Eds., Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 338–348. doi: 10.18653/v1/D17-1035.

[19]

D. Zhuang, X. Zhang, S. Song, and S. Hooker, “Randomness in neural network training: Characterizing the impact of tooling,” in Proceedings of the fourth conference on machine learning and systems, D. Marculescu, Y. Chi, and C. Wu, Eds., 2022, pp. 316–336. Available: https://proceedings.mlsys.org/paper_files/paper/2022/file/427e0e886ebf87538afdf0badb805b7f-Paper.pdf

[20]

S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “Statistical and machine learning forecasting methods: Concerns and ways forward,” PloS one, vol. 13, no. 3, p. e0194889, 2018.

[21]

X. Bouthillier, C. Laurent, and P. Vincent, “Unreproducible research is reproducible,” in Proceedings of the 36th international conference on machine learning, PMLR, 2019, pp. 725–734.

[22]

M. F. Goodchild and W. Li, “Replication across space and time must be weak in the social and environmental sciences,” Proceedings of the National Academy of Sciences, vol. 118, no. 35, p. e2015759118, 2021.

[23]

S. G. Finlayson et al., “The clinician and dataset shift in artificial intelligence,” New England Journal of Medicine, vol. 385, no. 3, pp. 283–286, 2021, doi: 10.1056/NEJMc2104626.

[24]

A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in Proceedings of the 29th IEEE/CVF conference on computer vision and pattern recognition, Colorado Springs, CO, USA: IEEE, 2011, pp. 1521–1528. doi: 10.1109/CVPR.2011.5995347.

[25]

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do ImageNet classifiers generalize to ImageNet?” in Proceedings of the 36th international conference on machine learning, K. Chaudhuri and R. Salakhutdinov, Eds., in Proceedings of machine learning research, vol. 97. PMLR, 2019, pp. 5389–5400. Available: https://proceedings.mlr.press/v97/recht19a.html

[26]

M. Ferrari Dacrema, P. Cremonesi, and D. Jannach, “Are we really making much progress? A worrying analysis of recent neural recommendation approaches,” in Proceedings of the 13th ACM conference on recommender systems, New York, NY, USA: Association for Computing Machinery, 2019, pp. 101–109. doi: 10.1145/3298689.3347058.

[27]

A. Belz, A. Shimorina, S. Agarwal, and E. Reiter, “The ReproGen shared task on reproducibility of human evaluations in NLG: Overview and results,” in Proceedings of the 14th international conference on natural language generation, Aberdeen, Scotland, UK: Association for Computational Linguistics, Aug. 2021, pp. 249–258.

[28]

Ç. Çöltekin, “Verification, reproduction and replication of NLP experiments: A case study on parsing Universal Dependencies,” in Proceedings of the fourth workshop on universal dependencies (UDW 2020), Barcelona, Spain (Online): Association for Computational Linguistics, Dec. 2020, pp. 46–56. Available: https://aclanthology.org/2020.udw-1.6/

[29]

M. Gauffriau and P. O. Larsen, “Counting methods are decisive for rankings based on publication and citation studies,” Scientometrics, vol. 64, no. 1, pp. 85–93, 2005.

[30]

M. Gauffriau, P. Larsen, I. Maye, A. Roulin-Perriard, and M. von Ins, “Comparisons of results of publication counting using different methods,” Scientometrics, vol. 77, no. 1, pp. 147–176, 2008.

[31]

R. B. d’Agostino, “An omnibus test of normality for moderate and large size samples,” Biometrika, vol. 58, no. 2, pp. 341–348, 1971.

[32]

R. D’agostino and E. S. Pearson, “Tests for departure from normality. Empirical results for the distributions of b 2 and√ b,” Biometrika, vol. 60, no. 3, pp. 613–622, 1973.

[33]

B. Efron and R. J. Tibshirani, An introduction to the bootstrap. Chapman; Hall/CRC, 1994.