reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Authors: Yutong Wu, Di Huang, Wenxuan Shi, Wei Wang, Yewen Pu, Lingzhe Gao, Shihao Liu, Ziyuan Nan, Kaizhao Yuan, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Dawei Yin, Xing Hu, Yunji Chen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate Inverse-Instruct on a range of open-source code models (e.g., Code Llama-Python and Deep Seek-Coder) and benchmarks (e.g., Human Eval(+), MBPP(+), DS-1000 and Multi PL-E), showing it consistently improves the base models. We evaluated Inverse Coder on a wide range of benchmarks (Section 6), including Human Eval(+) (Chen et al. 2021; Liu et al. 2023), MBPP(+) (Austin et al. 2021; Liu et al. 2023), Multi PL-E (Cassano et al. 2023), and DS-1000 (Lai et al. 2023).
Researcher Affiliation	Collaboration	1SKL of Processors, Institute of Computing Technology, CAS 2University of Chinese Academy of Sciences 3Baidu Inc., Beijing, China 4Autodesk Research
Pseudocode	No	The paper describes the method Inverse-Instruct in Section 4, detailing 'Code Preprocessing', 'Code Summarization', and 'Self-evaluation and Data Selection'. It also includes 'Figure 1: The overview of Inverse-Instruct' which is a flowchart-like diagram. However, it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code	Yes	Code https://github.com/wyt2000/Inverse Coder
Open Datasets	Yes	In this work, we mainly use evol-codealpaca-v1 as our original instruction tuning dataset {(xi, yi)}, which is widely used for instruction tuning of code LLMs (Wei et al. 2023; Yu et al. 2023; Song et al. 2024). It contains 111183 instruction-response pairs generated by Evol-Instruct using GPT-4. ... We evaluated Inverse Coder on a wide range of benchmarks (Section 6), including Human Eval(+) (Chen et al. 2021; Liu et al. 2023), MBPP(+) (Austin et al. 2021; Liu et al. 2023), Multi PL-E (Cassano et al. 2023), and DS-1000 (Lai et al. 2023). ... theblackcat102. 2023. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.
Dataset Splits	Yes	Following Magicoder (Wei et al. 2023), evol-codealpaca-v1 is decontaminated by removing data that contain docstrings or solutions from Human Eval (Chen et al. 2021), MBPP (Austin et al. 2021), Multi PL-E (Cassano et al. 2023), and DS-1000 (Lai et al. 2023), which are used to evaluate Inverse Coder.
Hardware Specification	Yes	To obtain the beginning code LLM M (hereinafter called Wizard Coder-GPT4), we fine-tune the base models on evol-codealpaca-v1 for 2 epochs using 8 NVIDIA A100-40GB SMX GPUs.
Software Dependencies	No	The paper mentions using 'the v LLM inference framework (Kwon et al. 2023)' but does not provide a specific version number for it or any other software components.
Experiment Setup	Yes	To obtain the beginning code LLM M (hereinafter called Wizard Coder-GPT4), we fine-tune the base models on evol-codealpaca-v1 for 2 epochs using 8 NVIDIA A100-40GB SMX GPUs. We set the initial learning rate at 5e 5 with 15 warmup steps and a linear learning rate scheduler. We use Adafactor (Shazeer and Stern 2018) as our optimizer and choose a batch size of 512 with a sequence truncation length of 1024.