Commit0: Library Generation from Scratch

Authors: Wenting Zhao, Nan Jiang, Celine Lee, Justin Chiu, Claire Cardie, Matthias Gallé, Alexander Rush

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that while current agents can pass some unit tests, none can yet fully reproduce full libraries. Results also show that interactive feedback is quite useful for models to generate code that passes more unit tests, validating the benchmarks that facilitate its use.
Researcher Affiliation Collaboration 1Cornell University 2Cohere EMAIL
Pseudocode No The paper describes the stages of the SDE-I agent conceptually in Figure 3 titled 'Overview of SDE-I' and in text, but it does not present a formal pseudocode or algorithm block with structured steps.
Open Source Code Yes We publicly release the benchmark1, the interactive environment2, and the leaderboard3. ... 2https://github.com/commit-0/commit0 ... We release the COMMIT0 benchmark in its entirety, along with all methods and results. We also provide the code for reproducing the dataset, so that it may be used for synthesizing data.
Open Datasets Yes We publicly release the benchmark1, the interactive environment2, and the leaderboard3. ... 1https://huggingface.co/datasets/commit0/commit0 ... We release the COMMIT0 benchmark in its entirety, along with all methods and results.
Dataset Splits Yes We create two dataset splits: lite, which includes libraries with fewer functions to implement, and all, which contains all libraries. Lite has a total of 16 libraries. Due to the complexity of COMMIT0 and budget constraints, we focus most of our evaluation on COMMIT0 lite.
Hardware Specification No The paper mentions using 'Modal for providing us with credits to run unit tests using their service' and refers to 'a single CPU' for test run time limits, but it does not specify any particular GPU or CPU models, memory, or detailed computer specifications used for training or inference of the models.
Software Dependencies No The paper mentions using 'Python libraries', 'ruff as our linter', and 'pytest-cov' but does not provide specific version numbers for these software components. It also lists several LLMs (GPT-4o-mini, Claude 3.5 Sonnet, Deep Seek-V2.5, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct, and Codestral) but these are models, not specific software dependency versions.
Experiment Setup Yes To assess the effectiveness of each stage in the SDE-I agent, we evaluate ablated versions of the method where we apply a fixed number of stages. ... First, we sample a module 1, 3, and 10 times, picking the best implementation based on pass rates before proceeding to the next module. Additionally, we test whether continuous iterations on unit test feedback will eventually enable agents to pass all unit tests. We conducted an experiment where we applied unit test feedback over different numbers of iterations: 1, 3, and 10.