reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization

Authors: Pedro Orvalho, Mikoláš Janota, Vasco M. Manquinho

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using Max SAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMS. This method allows LLMS to repair more programs and produce smaller fixes, outperforming other configurations and stateof-the-art symbolic program repair tools.
Researcher Affiliation	Academia	1Department of Computer Science, University of Oxford, Oxford, UK 2CIIRC, Czech Technical University in Prague, Czechia 3INESC-ID, IST, Universidade de Lisboa, Portugal EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper includes code listings (Listing 1, 2, 3, 4) and a diagram (Figure 1), but no explicitly labeled pseudocode or algorithm blocks. The CEGIS loop is described in text.
Open Source Code	Yes	Code https://doi.org/10.5281/zenodo.14517771
Open Datasets	Yes	We evaluated our work using C-PACK-IPAS (Orvalho, Janota, and Manquinho 2024a), which consists of 1431 semantically incorrect student C programs that compile successfully but fail at least one test.
Dataset Splits	No	The paper uses the C-PACK-IPAS dataset of 1431 programs as an evaluation benchmark. It does not describe any training/validation/test splits for this dataset, as the methodology uses pre-trained LLMs in a zero-shot setting.
Hardware Specification	Yes	All LLMS were run using NVIDIA RTX A4000 graphics cards with 16GB of memory on an Intel(R) Xeon(R) Silver 4130 CPU @ 2.10GHz with 48 CPUs and 128GB RAM. All the experiments related to the program repair tasks were conducted on an Intel(R) Xeon(R) Silver computer with 4210R CPUs @ 2.40GHz, using a memory limit of 10GB and a timeout of 90 seconds.
Software Dependencies	No	The paper mentions using 'Hugging Face s Pipeline architecture' and a tool called 'CFAULTS', but it does not specify any version numbers for these software components or for any other libraries/frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed for full reproducibility.
Experiment Setup	Yes	We used only open-access LLMS available on Hugging Face (Hugging Face 2024) with approximately 7 billion parameters... For Meta s LLAMA3, we utilized the 8B-parameter instruction-tuned variant. For CODELLAMA, we used the 7B-parameter instruct-tuned version... We employed GRANITE model with 8B-parameters... For PHI3, we opted for the mini version, which has 3.8B-parameters and a context length of 128K... For GEMMA, we used the 7B-parameter instruction-tuned version... For CODEGEMMA, we selected the 7Bparameter instruction-tuned variant... To fit all LLMS into 16GB GPUs, we used model quantization of 4bit. All the experiments related to the program repair tasks were conducted on an Intel(R) Xeon(R) Silver computer... using a memory limit of 10GB and a timeout of 90 seconds.