Documentation and Reproducibility Scores
To quantify reproducibility at scale, we define two complementary measures: Documentation Score and Reproducibility Score. Documentation Score captures the extent to which a paper reports information that is relevant for independent reproduction, while the Reproducibility Score estimates the probability that a study can in practice be reproduced from the available artifacts provided by the authors. Together, these measures provide complementary information on reproducibility-relevant documentation and the estimated reproducibility likelihood of the research itself.
These two metrics build on the reproducibility types and degrees introduced by [1]. Reproducibility types describe what documentation is used when reproducing an experiment: a textual description only (R1), code and description (R2), data and description (R3), or the complete experiment (R4). Reproducibility degrees describe the level at which results agree: an experiment is outcome reproducible if the reproduced outcome is identical (for example, same class), analysis reproducible if a different outcome yields the same conclusion under the same analysis, and interpretation reproducible if a different analysis still supports the same conclusion. The Documentation Score measures reporting practices along the R1–R4 axis, while the Reproducibility Score estimates the likelihood of reproduction along the outcome analysis dimension using observed reproduction rates under different artifact-sharing conditions [2].
Documentation Score
The Documentation Score \(D_p \in [0, 7]\) is based on seven reproducibility variables: pseudocode, open code, open data, dataset splits, software dependencies, hardware specifications, and experiment details. These variables were derived from the broader framework introduced by [3] and narrowed to a subset that could be identified accurately and consistently at scale using an LLM [4]. The selection was motivated by two considerations: each variable captures information that is substantively important for reproducing machine learning research, and each can be detected with sufficient reliability in paper text to support large-scale automated analysis. For each paper, the LLM assigns a value of 1 if a variable is present and 0 otherwise. The Documentation Score is the equally sum of these seven binary indicators.
We motivate this approach mainly with the following three reasons. First, the selected variables reflect core dimensions of experimental transparency. Access to code and data directly reduces barriers to reproduction, while information about dataset splits, software dependencies, hardware, and experimental details reduces ambiguity surrounding the acquisition of results. Pseudocode further helps clarify the intended method when implementation details are not fully recoverable from the text alone. Second, prior work has shown that these factors are closely related to whether AI research can be reproduced in practice; [5] found that pseudocode, readability, and specification of hyperparameters were among the features most predictive of a successful independent replication. Differences in software framework [6], [7], software versions [8], [9], [10], [11], ancillary software [8], [9], [10], [12], [13], processing units [8], [10], [14], [15], random seeds [12], [16], [17], [18], [19], and dataset splits [16], [20], [21] have each been shown to produce substantially different results when reproducing experiments. Third, a restricted set of variables is appropriate in a large-scale study; the purpose of the Documentation Score is not to exhaustively characterise all determinants of reproducibility, but to measure a reproducibility-relevant subset of reporting practices that can be observed consistently across the vast majority of published papers. [4] confirmed F1 scores exceeding 90% for most variables on an independently labelled evaluation dataset of 160 papers. In this sense, the Documentation Score should be interpreted as a measure of documentation quality with respect to reproducibility, rather than a direct measure of successful reproduction.
Reproducibility Score
The Reproducibility Score \(R_p \in [0, 1]\) complements the Documentation Score by estimating the extent to which a collection of papers is reproducible in practice. It is based on empirical results reported by [2], who found that papers sharing both open code and data could be reproduced 86% of the time, whereas papers sharing open data but not code could be reproduced 33% of the time. Papers without shared data were not attempted and are assigned a score of 0. Following this approach, we assign each paper a score according to:
\[\begin{equation} R_p = \begin{cases} 0.86 & \text{if both open code and data are available,} \\ 0.33 & \text{if open data is available but code is not,} \\ 0 & \text{if open data is not available.} \end{cases} \label{eq:rep-score} \end{equation}\]
The Reproducibility Score for a set of papers is the average of these values and therefore ranges from 0 to 1. Note that with the current rule set of [eq:rep-score] the range is effectively from 0 to 0.86 as no paper can be awarded currently based on the empirical data.
This formulation is preferable to simpler additive alternatives because it is empirically calibrated rather than normatively assigned. A scheme that awards one point for open code and one for open data would measure artifact sharing, but it would not estimate reproducibility. The present weighting reflects observed differences in reproduction outcomes under different sharing conditions. It also captures an important asymmetry between code and data. In many machine learning settings, the absence of open data is a more severe obstacle than the absence of code, because using a different dataset introduces its own sources of irreproducibility: dataset bias from collection methods [13], [22], [23], [24], [25], inconsistencies introduced by pre-processing [26] and annotation quality [27], and differences in how data is split across training, validation, and test sets [7], [16], [20], [21], [28]. At the same time, sharing data without code does not ensure reproducibility, since reproduction may still fail because of missing algorithmic and implementation details such as hyperparameter choices, optimisation procedures [6], [16], [18], [21], initialisation [12], [19], random seeds [12], [16], [17], [18], [19], or undocumented engineering decisions. The Reproducibility Score therefore reflects both the central role of dataset availability and the additional contribution of code availability to successful reproduction.
At the same time, this score should be interpreted with appropriate caution. The underlying empirical estimates are based on a relatively small sample of reproduction attempts [2] and should therefore be understood as approximate probabilities rather than universal constants. Nevertheless, they provide a more defensible basis for large-scale estimation than arbitrary equal weighting, which, for the purposes of the AI Reproducibility Index, is an important advantage. The objective of the Index is to compare venues, countries, and institutions systematically with respect to the extent to which their research is documented and reproducible. The Documentation Score serves this objective by measuring the presence of key reproducibility-related reporting practices, while the Reproducibility Score translates the most consequential artifact-sharing practices into an empirically grounded estimate of reproducibility potential.
Entity-Level Attribution and Ranking
We distribute credit for each paper proportionally among its authors’ affiliated entities, following the fractional counting approach [29], [30]. In the description below, we use the term entity to refer to either a country or an institution. If an author has more than one affiliation, only the first affiliation is used for attribution.
For a paper with N authors, each author’s affiliated entity receives a share equal to \(1/N\) of that paper’s score. For example, a paper with four authors, two from entity A and one each from entities B and C, assigns half the paper’s score to entity A and one quarter each to entities B and C. Authors with unknown affiliations are retained in the denominator to prevent artificial score inflation. The entity’s overall score is the weighted mean of these fractional contributions across all papers to which it contributed. Only entities with at least 25 absolute contributing papers are included for countries and at least 100 for institutions. Entities are ranked by their mean weighted score.
Because entities with few contributing papers may exhibit extreme scores due to variance alone, we account for statistical uncertainty using a robust standard error that applies Bessel’s correction based on the unweighted count of distinct papers per entity. We tested the score distributions for normality [31], [32] and rejected the null hypothesis of normality. We therefore report 95% confidence intervals using the bias-corrected and accelerated (BCa) bootstrap method [33] with 10,000 resamples, which does not assume normality of the underlying distribution.