VERSE: Verification-based Self-Play for Code Instructions
Authors: Hao Jiang, Qi Liu, Rui Li, Yuze Zhao, Yixiao Ma, Shengyu Ye, Junyu Lu, Yu Su
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness. To assess the efficacy of VERSE, we conduct experiments across various tasks, encompassing clone detection, defect detection, program synthesis, automated program repair, and code explanation. VERSE exhibits noteworthy enhancements across these diverse tasks. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3School of Computer Science and Artificial Intelligence, Hefei Normal University EMAIL, {qiliuql}@ustc.edu.cn, EMAIL |
| Pseudocode | No | The paper describes its methodology in Section 3 "Verification-based Self-Play" with text and figures (Figure 4 shows a pipeline), but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Code https://github.com/TechxGenus/VERSE |
| Open Datasets | Yes | Our experiments encompass five tasks from two widely recognized benchmarks, CODEXGLUE (Lu et al. 2021) and HUMANEVALPACK (Muennighoff et al. 2023), these tasks aim to evaluate the quality of responses to diverse code instructions. The first two tasks are collected from the CODEXGLUE benchmark, while the last three tasks are collected from the HUMANEVALPACK benchmark. Three tasks in HUMANEVALPACK are extended from the HUMANEVAL dataset (Cassano et al. 2023). |
| Dataset Splits | No | The paper states: "Ultimately, for different models, we obtain around 50,000 valid instructions with corresponding responses and verification scores. To ensure a fair comparison, we retain 40,000 valid instructions for training, adjusting for variations in the amount of valid data obtained by different models." While this describes the training set size for their self-generated data, it does not provide explicit train/test/validation splits for the external benchmark datasets (CODEXGLUE, HUMANEVALPACK) used for evaluation, which is required for reproducibility. |
| Hardware Specification | Yes | Our models are trained on 2 Nvidia A100 GPUs for 2 epochs using the Transformers library. |
| Software Dependencies | No | The paper mentions "Transformers library", "Deepspeed Ze RO3", "Flash Attention2", and "Adafactor optimizer" but does not provide specific version numbers for any of these software components, which is necessary for a reproducible description of ancillary software. |
| Experiment Setup | Yes | Our models are trained on 2 Nvidia A100 GPUs for 2 epochs using the Transformers library. We employ Alpaca-style instruction templates (Taori et al. 2023) for training, and we set the hyperparameter α for calculating the verification score to 4. Memory efficiency and speed are enhanced through techniques, including Deepspeed Ze RO3 (Rajbhandari et 2019) and Flash Attention2 (Dao 2023). We configure a batch size per GPU of 32, a maximum sequence length of 2048, and a learning rate of 5e-5. Training employs the Adafactor optimizer (Shazeer and Stern 2018), coupled with a cosine scheduler featuring 15 warm-up steps. |