reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Machine Learning Models Have a Supply Chain Problem

Authors: Sarah Meiklejohn, Hayden Blauzvern, Mihai Maruseac, Spencer Schrock, Laurent Simon, Ilia Shumailov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implemented model signing in Python, on top of the sigstore-python library. Our implementation is 4500 lines of code and available as an open-source library. We benchmarked our hashing code, using SHA-256 and both the na ıve and list-based approaches, for file sizes ranging from 1 B to 1 TB, on three machines. We also summarize in Table 1 the costs associated with hashing and signing a variety of large open models (as obtained from Hugging Face). To benchmark the costs associated with this type of training data commitment, we use the available Rust code for the Parakeet verifiable registry (Malvai et al., 2023). We measured the costs of computing a commitment and proving and verifying against it for datasets ranging from 1000 to 10 billion data points.
Researcher Affiliation	Industry	1Google 2Google Deep Mind. Correspondence to: Sarah Meiklejohn <EMAIL>.
Pseudocode	Yes	The algorithm for forming this type of commitment, ZKS.Commit, can be found in Figure 3. (Figure 3: Algorithms for our zero-knowledge set, assuming an underlying accumulator Acc and VRF VRF.)
Open Source Code	Yes	We have released this work as an open-source library and are working to integrate it into existing model hubs. Our implementation is 4500 lines of code and available as an open-source library.12 (footnote 12: https://github.com/sigstore/model-transparency/)
Open Datasets	Yes	smaller image models might be trained on well known datasets like CIFAR-10 (which has 50K rows)13 or MNIST (60K rows)14, while larger datasets like You Tube-Commons (400K rows)15 are used for finetuning language models for Q&A tasks. (Footnotes 13, 14, 15 point to Hugging Face datasets: https://huggingface.co/datasets/cifar10, https://huggingface.co/datasets/mnist, https://huggingface.co/datasets/Ple IAs/You Tube Commons)
Dataset Splits	No	The paper mentions datasets like CIFAR-10 (which has 50K rows), MNIST (60K rows), and You Tube-Commons (400K rows) in Section 6.5, but does not provide specific training/test/validation splits used for any experimental setup within this paper.
Hardware Specification	Yes	We benchmarked our hashing code... on three machines: (1) M1 with 24 v CPUs running on AMD EPYC 7B12 at 2.25 GHz and 96 GB of RAM; (2) M2 with 64 v CPUs running on AMD EPYC 7B13 at 2.45 GHz and with 120 GB of RAM; and (3) M3 with 128 v CPUs running on AMD EPYC 7B13 CPUs at 2.45 GHz and with 240 GB of RAM.
Software Dependencies	No	The paper mentions implementing model signing in Python on top of the sigstore-python library and using available Rust code for the Parakeet verifiable registry, but does not provide specific version numbers for any of these software components.
Experiment Setup	No	The paper details the setup for benchmarking its cryptographic signing and verification process, including hashing approaches ('na ıve' and 'list-based'), file sizes (1 B to 1 TB), chunk size (default 1 GB), and signature scheme (ECDSA P256). However, it does not specify machine learning hyperparameters or training configurations for models like learning rate, batch size, or epochs, as the core experiment is on the cryptographic transparency solution rather than ML model training itself.