Execution-based Code Generation using Deep Reinforcement Learning
Authors: Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, Chandan K. Reddy
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs. 4 Experiments We evaluate PPOCoder on three different code generation tasks: (i) Code Completion automatically completes partial Python code snippets; (ii) Code Translation involves translating between any language-pair among six different PLs (Python, Java, C#, C++, PHP, C); and (iii) Program Synthesis (NL2Code) generates a Python function given a natural language (NL) description. |
| Researcher Affiliation | Academia | Parshin Shojaee EMAIL Aneesh Jain EMAIL Sindhu Tipirneni EMAIL Chandan K Reddy EMAIL Department of Computer Science, Virginia Tech, Arlington, VA |
| Pseudocode | Yes | Alg. 1 provides the pseudocode of PPOCoder. (Algorithm 1: PPOCoder block is present on page 7) |
| Open Source Code | Yes | The source code for PPOCoder can be found at https://github.com/reddy-lab-code-research/PPOCoder. |
| Open Datasets | Yes | For this downstream task, we employ the Python corpus in Code Search Net (CSN) 1 (Husain et al., 2019). We use the XLCo ST 2 (Zhu et al., 2022a) dataset for the code translation task. In this task, we use the APPS (Hendrycks et al., 2021) dataset. zero-shot performance of the APPS fine-tuned models was examined on the MBPP (Austin et al., 2021) program synthesis benchmark. |
| Dataset Splits | Yes | We extract 50K compilable Python methods with sufficient length (at least 64 tokens) and randomly split the data to train/val/test sets with 40K/5K/5K samples. The APPS (Hendrycks et al., 2021) dataset comprising 10k coding problems of varying difficulty levels, split 50/50 for train/test sets. Table 6 in Appendix B shows the detailed statistics of these compilable filtered samples across all six PLs. |
| Hardware Specification | Yes | All of our experiments are implemented with Py Torch and trained using 4 Quadro RTX 8000 GPUs, with 48GB of RAM. |
| Software Dependencies | Yes | All of our experiments are implemented with Py Torch... For Java compilation, we use the javac compiler, version 1.8.0. We use gcc version 7.5.0 for C and C++ compilations. Syntax checking for PHP is performed using the php -l command, PHP version 7.2.24. C# compilation is also checked using the Mono C# compiler, version 4.6.2.0. |
| Experiment Setup | Yes | In all our experiments, we employ batch size of 32, Adam W optimizer with a weight decay of 0.05, and a learning rate that warms up from 1e 7 to 2e 5 over the first 1000 steps, then decays based on the inverse square root of the number of steps, as outlined in (Loshchilov & Hutter, 2019). PPOCoder is implemented with the discount rate γ = 1, KL divergence penalty coefficient β = 0.1, policy ratio clip range ϵ = 0.2 and the value error coefficient α = 0.001. To sample synthetic hypothesis from the stochastic policy, we use the top-k sampling with k = 5 as the action space size. We are training PPOCoder +Code T5 with num_samples = 3 as the number of synthetic samples generated for each sample of the CSN dataset. Therefore, PPOCoder observes 40K 3=120K input-output sample pairs with synthetic outputs during RL optimization for this task. In all code completion experiments on CSN, we set the maximum source and target sequence length as 400, and the maximum number of RL optimization epochs as 6. |