reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Commit0: Library Generation from Scratch

Authors: Wenting Zhao, Nan Jiang, Celine Lee, Justin Chiu, Claire Cardie, Matthias Gallé, Alexander Rush

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use.
Researcher Affiliation	Collaboration	1Cornell University 2Cohere EMAIL
Pseudocode	No	The paper describes the stages of the SDE-I agent conceptually in Figure 3 titled 'Overview of SDE-I' and in text, but it does not present a formal pseudocode or algorithm block with structured steps.
Open Source Code	Yes	We publicly release the benchmark1, the interactive environment2, and the leaderboard3. ... 2https://github.com/commit-0/commit0 ... We release the COMMIT0 benchmark in its entirety, along with all methods and results. We also provide the code for reproducing the dataset, so that it may be used for synthesizing data.
Open Datasets	Yes	We publicly release the benchmark1, the interactive environment2, and the leaderboard3. ... 1https://huggingface.co/datasets/commit0/commit0 ... We release the COMMIT0 benchmark in its entirety, along with all methods and results.
Dataset Splits	Yes	We create two dataset splits: lite, which includes libraries with fewer functions to implement, and all, which contains all libraries. Lite has a total of 16 libraries. Due to the complexity of COMMIT0 and budget constraints, we focus most of our evaluation on COMMIT0 lite.
Hardware Specification	No	The paper mentions using 'Modal for providing us with credits to run unit tests using their service' and refers to 'a single CPU' for test run time limits, but it does not specify any particular GPU or CPU models, memory, or detailed computer specifications used for training or inference of the models.
Software Dependencies	No	The paper mentions using 'Python libraries', 'ruff as our linter', and 'pytest-cov' but does not provide specific version numbers for these software components. It also lists several LLMs (GPT-4o-mini, Claude 3.5 Sonnet, Deep Seek-V2.5, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, and Codestral) but these are models, not specific software dependency versions.
Experiment Setup	Yes	To assess the effectiveness of each stage in the SDE-I agent, we evaluate ablated versions of the method where we apply a fixed number of stages. ... First, we sample a module 1, 3, and 10 times, picking the best implementation based on pass rates before proceeding to the next module. Additionally, we test whether continuous iterations on unit test feedback will eventually enable agents to pass all unit tests. We conducted an experiment where we applied unit test feedback over different numbers of iterations: 1, 3, and 10.