Counterexample Guided Program Repair Using Zero-Shot Learning and MaxSAT-based Fault Localization
Authors: Pedro Orvalho, Mikoláš Janota, Vasco M. Manquinho
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on 1,431 incorrect student programs show that our counterexample guided approach, using Max SAT-based bug-free program sketches, significantly improves the repair capabilities of all six evaluated LLMS. This method allows LLMS to repair more programs and produce smaller fixes, outperforming other configurations and stateof-the-art symbolic program repair tools. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of Oxford, Oxford, UK 2CIIRC, Czech Technical University in Prague, Czechia 3INESC-ID, IST, Universidade de Lisboa, Portugal EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes code listings (Listing 1, 2, 3, 4) and a diagram (Figure 1), but no explicitly labeled pseudocode or algorithm blocks. The CEGIS loop is described in text. |
| Open Source Code | Yes | Code https://doi.org/10.5281/zenodo.14517771 |
| Open Datasets | Yes | We evaluated our work using C-PACK-IPAS (Orvalho, Janota, and Manquinho 2024a), which consists of 1431 semantically incorrect student C programs that compile successfully but fail at least one test. |
| Dataset Splits | No | The paper uses the C-PACK-IPAS dataset of 1431 programs as an evaluation benchmark. It does not describe any training/validation/test splits for this dataset, as the methodology uses pre-trained LLMs in a zero-shot setting. |
| Hardware Specification | Yes | All LLMS were run using NVIDIA RTX A4000 graphics cards with 16GB of memory on an Intel(R) Xeon(R) Silver 4130 CPU @ 2.10GHz with 48 CPUs and 128GB RAM. All the experiments related to the program repair tasks were conducted on an Intel(R) Xeon(R) Silver computer with 4210R CPUs @ 2.40GHz, using a memory limit of 10GB and a timeout of 90 seconds. |
| Software Dependencies | No | The paper mentions using 'Hugging Face s Pipeline architecture' and a tool called 'CFAULTS', but it does not specify any version numbers for these software components or for any other libraries/frameworks (e.g., PyTorch, TensorFlow, Python version) that would be needed for full reproducibility. |
| Experiment Setup | Yes | We used only open-access LLMS available on Hugging Face (Hugging Face 2024) with approximately 7 billion parameters... For Meta s LLAMA3, we utilized the 8B-parameter instruction-tuned variant. For CODELLAMA, we used the 7B-parameter instruct-tuned version... We employed GRANITE model with 8B-parameters... For PHI3, we opted for the mini version, which has 3.8B-parameters and a context length of 128K... For GEMMA, we used the 7B-parameter instruction-tuned version... For CODEGEMMA, we selected the 7Bparameter instruction-tuned variant... To fit all LLMS into 16GB GPUs, we used model quantization of 4bit. All the experiments related to the program repair tasks were conducted on an Intel(R) Xeon(R) Silver computer... using a memory limit of 10GB and a timeout of 90 seconds. |