reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RLTF: Reinforcement Learning from Unit Test Feedback

Authors: Jiate Liu, Yiqin Zhu, Kaiwen Xiao, QIANG FU, Xiao Han, Yang Wei, Deheng Ye

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code is available at: https://github.com/Zyq-scut/RLTF. A detailed ablation study demonstrates the effectiveness of our approach. Additionally, we perform tests on different LLMs (e.g., Code T5, Code Gen), illustrating the robustness of our method and its applicability to different base models.
Researcher Affiliation	Industry	Jiate Liu EMAIL Tencent Yiqin Zhu EMAIL Tencent Kaiwen Xiao EMAIL Tencent Qiang Fu EMAIL Tencent Xiao Han EMAIL Tencent Wei Yang EMAIL Tencent Deheng Ye EMAIL Tencent
Pseudocode	No	The paper describes algorithms and frameworks using prose and mathematical equations, and provides a diagram (Figure 1), but does not contain explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code is available at: https://github.com/Zyq-scut/RLTF.
Open Datasets	Yes	APPS Benchmark. We first evaluate using the challenging APPS (Automated Programming Progress Standard) program synthesis benchmark presented by (Hendrycks et al., 2021) ... MBPP Benchmark. To further evaluate our framework, we also employ an additional, smaller, and simpler Python program synthesis dataset called MBPP (Mostly Basic Programming Problems), introduced by (Austin et al., 2021).
Dataset Splits	Yes	APPS Benchmark. The benchmark consists of a total of 10,000 coding problems, with an equal train-test split. The dataset is classified into three difficulty levels: Introductory (3,639; train/test = 2,639/1,000), Interview (5,000; train/test = 2,000/3,000), and Competition (1,361; train/test = 361/1,000). ... MBPP Benchmark. The dataset consists of 974 instances, with 374 instances for training, 90 instances for validation, and 500 instances for testing, while reserving 10 instances for few-shot learning evaluations.
Hardware Specification	Yes	For the APPS benchmark, we employ the RLTF framework to fine-tune the pretrained Code T5 model. We utilized a machine equipped with 8 NVIDIA V100 GPUs, each with 32GB of memory, for training purposes. ... Concurrently, three additional machines with similar 8-card V100 GPU configurations were used to generate the latest samples.
Software Dependencies	No	The paper mentions using Python and models like Code T5 and Code Gen, but it does not specify version numbers for any software libraries, frameworks, or specific tools used for implementation.
Experiment Setup	Yes	Each GPU carried a batch size of 32, and a learning rate of 2e-6 was employed. ... We updated the online buffer every 50 steps. Following the same approach as Code RL, half of the steps were for SL training, while the other half focused on RL training. The length of the online buffer was set to 6400. ... During testing, we employed Nucleus sampling with a top value of 0.95 and set the temperature parameter to 0.6. ... While generating samples, we also used Nucleus sampling with a top value of 0.95 and set the temperature parameter to 1.2.