CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Authors: Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, Saayan Mitra
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We comprehensively evaluate the effectiveness of Code Lutra on challenging data query and data science tasks, where the LLM is tasked with generating the correct SQL or Python code to solve a given problem. We compare Code Lutra with 13 open-source and closed-source LLMs that are competitive in code generation. Notably, on the data query task, our framework allows Llama-3-8B (Dubey et al., 2024) to achieve an execution accuracy of 76.6%, which exceeds GPT-4 s 74.4%. |
| Researcher Affiliation | Collaboration | Leitian Tao EMAIL University of Wisconsin-Madison, Xiang Chen EMAIL Adobe Research, Tong Yu EMAIL Adobe Research, Tung Mai EMAIL Adobe Research, Ryan A. Rossi EMAIL Adobe Research, Yixuan Li EMAIL University of Wisconsin-Madison, Saayan Mitra EMAIL Adobe Research |
| Pseudocode | Yes | We summarize our algorithm in implementation in the Algorithm 1. Algorithm 1 Code Lutra |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for their methodology, nor does it provide a link to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | We conduct our experiments on two cross-domain datasets for data query, Spider (Yu et al., 2018) and BIRD (Li et al., 2024), and a data science dataset, DS-1000 (Lai et al., 2023). |
| Dataset Splits | Yes | We split DS-1000 into 500 samples for training and 500 for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running the experiments. |
| Software Dependencies | No | The paper mentions hyperparameters and a framework (Deep Speed Zero stage 2) but does not provide specific version numbers for software dependencies or libraries used to replicate the experiment. |
| Experiment Setup | Yes | Table 4: Summary of training hyperparameters for data query and data science for each iteration. Parameters: Number of epochs 1, Learning rate 5 * 10^-5, Beta 0.1 (Data Query) / 0.5 (Data Science), Batch size 16, Gradient accumulation steps 1, Maximum sequence length 2048 (Data Query) / 512 (Data Science), Deep Speed Zero stage 2, Weight decay 0.0001, Lo RA rank 8, Lambda 1.0 (Data Query) / 0.5 (Data Science). |