reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AuPair: Golden Example Pairs for Code Repair

Authors: Aditi Mavalankar, Hassan Mansoor, Zita Marinho, Mariia Samsikova, Tom Schaul

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-N and self-repair, and also exhibits strong generalisation across datasets and models.
Researcher Affiliation	Industry	1Google Deep Mind. Correspondence to: Aditi Mavalankar <EMAIL>.
Pseudocode	Yes	Algorithm 1 Fix quality matrix computation Algorithm 2 Submodular Au Pair extraction Algorithm 3 Pair Generation
Open Source Code	No	The paper does not contain any explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets	Yes	We use 7 datasets that contain problems and test cases from competitive programming contests: 1) Code Forces (8.8k problems), 2) At Coder (1.3k problems), 3) Hacker Earth (1.2k problems), 4) Code Chef (768 problems), 5) Live Code Bench (400 problems), 6) Code Jam (180 problems), and 7) Aizu (2.2k problems) (Li et al., 2022b; Jain et al., 2024).
Dataset Splits	Yes	Our training / validation / test split proportions for the Code Forces and At Coder datasets are 37.5/12.5/50%.
Hardware Specification	No	The paper discusses various LLM models (Gemini-1.5-Pro, GPT-4o-mini, etc.) but does not specify the underlying hardware (e.g., GPU models, CPU types) used for running the experiments or training these models.
Software Dependencies	No	The paper mentions specific LLM models (e.g., Gemini-1.5-Pro, GPT-4o-mini, Gemma-27B, Gemma-9B) and refers to BERT for embeddings, but it does not provide version numbers for general software dependencies, programming languages, or libraries used for implementing their methodology.
Experiment Setup	Yes	Our compute budget is N = 32, of which 4 LLM calls are used to generate verbal feedback and 7 LLM calls to generate repaired code for each verbal feedback. To ensure the sampling of high-quality diverse responses in best-of-N, we set the temperature to 1.0 (Renze & Guven, 2024). We set k = 32 in this algorithm.