reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VERSE: Verification-based Self-Play for Code Instructions

Authors: Hao Jiang, Qi Liu, Rui Li, Yuze Zhao, Yixiao Ma, Shengyu Ye, Junyu Lu, Yu Su

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness. To assess the efficacy of VERSE, we conduct experiments across various tasks, encompassing clone detection, defect detection, program synthesis, automated program repair, and code explanation. VERSE exhibits noteworthy enhancements across these diverse tasks.
Researcher Affiliation	Academia	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3School of Computer Science and Artificial Intelligence, Hefei Normal University EMAIL, {qiliuql}@ustc.edu.cn, EMAIL
Pseudocode	No	The paper describes its methodology in Section 3 "Verification-based Self-Play" with text and figures (Figure 4 shows a pipeline), but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Code https://github.com/TechxGenus/VERSE
Open Datasets	Yes	Our experiments encompass five tasks from two widely recognized benchmarks, CODEXGLUE (Lu et al. 2021) and HUMANEVALPACK (Muennighoff et al. 2023), these tasks aim to evaluate the quality of responses to diverse code instructions. The first two tasks are collected from the CODEXGLUE benchmark, while the last three tasks are collected from the HUMANEVALPACK benchmark. Three tasks in HUMANEVALPACK are extended from the HUMANEVAL dataset (Cassano et al. 2023).
Dataset Splits	No	The paper states: "Ultimately, for different models, we obtain around 50,000 valid instructions with corresponding responses and verification scores. To ensure a fair comparison, we retain 40,000 valid instructions for training, adjusting for variations in the amount of valid data obtained by different models." While this describes the training set size for their self-generated data, it does not provide explicit train/test/validation splits for the external benchmark datasets (CODEXGLUE, HUMANEVALPACK) used for evaluation, which is required for reproducibility.
Hardware Specification	Yes	Our models are trained on 2 Nvidia A100 GPUs for 2 epochs using the Transformers library.
Software Dependencies	No	The paper mentions "Transformers library", "Deepspeed Ze RO3", "Flash Attention2", and "Adafactor optimizer" but does not provide specific version numbers for any of these software components, which is necessary for a reproducible description of ancillary software.
Experiment Setup	Yes	Our models are trained on 2 Nvidia A100 GPUs for 2 epochs using the Transformers library. We employ Alpaca-style instruction templates (Taori et al. 2023) for training, and we set the hyperparameter α for calculating the verification score to 4. Memory efficiency and speed are enhanced through techniques, including Deepspeed Ze RO3 (Rajbhandari et 2019) and Flash Attention2 (Dao 2023). We configure a batch size per GPU of 32, a maximum sequence length of 2048, and a learning rate of 5e-5. Training employs the Adafactor optimizer (Shazeer and Stern 2018), coupled with a cosine scheduler featuring 15 warm-up steps.