Fault-Aware Neural Code Rankers

Authors: Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, Jianfeng Gao

NeurIPS 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that CODERANKER can significantly increase the pass@1 accuracy of various code generation models (including Codex [11], GPT-Neo, GPT-J) on APPS [25], Human Eval [11] and MBPP [3] datasets.
Researcher Affiliation Industry Microsoft Research EMAIL
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes The code and data is released on Git Hub https://github.com/microsoft/CodeRanker.
Open Datasets Yes We consider three existing code generation datasets for our evaluation: (1) APPS [25]: a collection of 5000 training and 5000 test tasks collected from coding competitions and interview problems, (2) Human Eval [11]: a set of 164 test tasks, and (3) MBPP [3]: a set of 974 mostly basic python programming tasks with 474 training problems and 500 test problems.
Dataset Splits Yes The APPS dataset does not come with a validation dataset, so we used a set of 600 tasks from the original training dataset for validation; these are, then, excluded from the training dataset.
Hardware Specification Yes All experiments are conducted on V100-32GB GPUs.
Software Dependencies No No specific software versions (e.g., Python 3.x, PyTorch 1.x) or library versions were explicitly mentioned.
Experiment Setup Yes We finetuned GPT-J and GPT-Neo code generation models on the APPS training dataset for 2 epochs with a batch size of 256 and a learning rate of 1e-5, and chose the checkpoint that has the lowest validation loss. For inference, we used temperature sampling with T = 0.8 for Codex model and T = 0.9 for the GPT-J and GPT-Neo models unless specified otherwise. We finetuned the CODERANKER models for 30 epochs with a batch size of 512 and a learning rate of 1e-4, and chose the checkpoint that results in the best ranked pass@1 metric on the validation dataset.