reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Execution-based Code Generation using Deep Reinforcement Learning

Authors: Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, Chandan K. Reddy

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs. 4 Experiments We evaluate PPOCoder on three different code generation tasks: (i) Code Completion automatically completes partial Python code snippets; (ii) Code Translation involves translating between any language-pair among six different PLs (Python, Java, C#, C++, PHP, C); and (iii) Program Synthesis (NL2Code) generates a Python function given a natural language (NL) description.
Researcher Affiliation	Academia	Parshin Shojaee EMAIL Aneesh Jain EMAIL Sindhu Tipirneni EMAIL Chandan K Reddy EMAIL Department of Computer Science, Virginia Tech, Arlington, VA
Pseudocode	Yes	Alg. 1 provides the pseudocode of PPOCoder. (Algorithm 1: PPOCoder block is present on page 7)
Open Source Code	Yes	The source code for PPOCoder can be found at https://github.com/reddy-lab-code-research/PPOCoder.
Open Datasets	Yes	For this downstream task, we employ the Python corpus in Code Search Net (CSN) 1 (Husain et al., 2019). We use the XLCo ST 2 (Zhu et al., 2022a) dataset for the code translation task. In this task, we use the APPS (Hendrycks et al., 2021) dataset. zero-shot performance of the APPS fine-tuned models was examined on the MBPP (Austin et al., 2021) program synthesis benchmark.
Dataset Splits	Yes	We extract 50K compilable Python methods with sufficient length (at least 64 tokens) and randomly split the data to train/val/test sets with 40K/5K/5K samples. The APPS (Hendrycks et al., 2021) dataset comprising 10k coding problems of varying difficulty levels, split 50/50 for train/test sets. Table 6 in Appendix B shows the detailed statistics of these compilable filtered samples across all six PLs.
Hardware Specification	Yes	All of our experiments are implemented with Py Torch and trained using 4 Quadro RTX 8000 GPUs, with 48GB of RAM.
Software Dependencies	Yes	All of our experiments are implemented with Py Torch... For Java compilation, we use the javac compiler, version 1.8.0. We use gcc version 7.5.0 for C and C++ compilations. Syntax checking for PHP is performed using the php -l command, PHP version 7.2.24. C# compilation is also checked using the Mono C# compiler, version 4.6.2.0.
Experiment Setup	Yes	In all our experiments, we employ batch size of 32, Adam W optimizer with a weight decay of 0.05, and a learning rate that warms up from 1e 7 to 2e 5 over the first 1000 steps, then decays based on the inverse square root of the number of steps, as outlined in (Loshchilov & Hutter, 2019). PPOCoder is implemented with the discount rate γ = 1, KL divergence penalty coefficient β = 0.1, policy ratio clip range ϵ = 0.2 and the value error coefficient α = 0.001. To sample synthetic hypothesis from the stochastic policy, we use the top-k sampling with k = 5 as the action space size. We are training PPOCoder +Code T5 with num_samples = 3 as the number of synthetic samples generated for each sample of the CSN dataset. Therefore, PPOCoder observes 40K 3=120K input-output sample pairs with synthetic outputs during RL optimization for this task. In all code completion experiments on CSN, we set the maximum source and target sequence length as 400, and the maximum number of RL optimization epochs as 6.