reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fault-Aware Neural Code Rankers

Authors: Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, Jianfeng Gao

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that CODERANKER can signiﬁcantly increase the pass@1 accuracy of various code generation models (including Codex [11], GPT-Neo, GPT-J) on APPS [25], Human Eval [11] and MBPP [3] datasets.
Researcher Affiliation	Industry	Microsoft Research EMAIL
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	The code and data is released on Git Hub https://github.com/microsoft/CodeRanker.
Open Datasets	Yes	We consider three existing code generation datasets for our evaluation: (1) APPS [25]: a collection of 5000 training and 5000 test tasks collected from coding competitions and interview problems, (2) Human Eval [11]: a set of 164 test tasks, and (3) MBPP [3]: a set of 974 mostly basic python programming tasks with 474 training problems and 500 test problems.
Dataset Splits	Yes	The APPS dataset does not come with a validation dataset, so we used a set of 600 tasks from the original training dataset for validation; these are, then, excluded from the training dataset.
Hardware Specification	Yes	All experiments are conducted on V100-32GB GPUs.
Software Dependencies	No	No specific software versions (e.g., Python 3.x, PyTorch 1.x) or library versions were explicitly mentioned.
Experiment Setup	Yes	We ﬁnetuned GPT-J and GPT-Neo code generation models on the APPS training dataset for 2 epochs with a batch size of 256 and a learning rate of 1e-5, and chose the checkpoint that has the lowest validation loss. For inference, we used temperature sampling with T = 0.8 for Codex model and T = 0.9 for the GPT-J and GPT-Neo models unless speciﬁed otherwise. We ﬁnetuned the CODERANKER models for 30 epochs with a batch size of 512 and a learning rate of 1e-4, and chose the checkpoint that results in the best ranked pass@1 metric on the validation dataset.