reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning

Authors: Dong Huang, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao Qing, Heming Cui, Zhijiang Guo, Jie Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate significant improvements when fine-tuning with EFFIINSTRUCT.
Researcher Affiliation	Academia	1University of Hong Kong 2National University of Singapore 3Singapore University of Technology and Design 4University of Edinburgh 5Beijing University of Posts and Telecommunications 6University of Cambridge 7King s College London. Correspondence to: Zhijiang Guo <EMAIL>.
Pseudocode	No	The paper describes the methodology for constructing the EFFIINSTRUCT dataset and fine-tuning LLMs through narrative text and diagrams (Figure 1), but it does not include a formal pseudocode block or algorithm section with structured, code-like steps for any of the main processes.
Open Source Code	Yes	Dataset and Code are available at https:// github.com/huangd1999/Effi Coder.
Open Datasets	Yes	We construct EFFIINSTRUCT, to the best of our knowledge, it is the first instruction-tuning dataset designed to improve the efficiency of LLM-generated code, facilitating fine-tuning for more efficient code generation. ... Dataset and Code are available at https:// github.com/huangd1999/Effi Coder. ... We collect the candidate tasks from the open-source code LLM training sets, which include Self Code Align (Self Code Align; Wei et al. 2024a), Code Feedback-Filtered-Instruction (Code Feed; MAP 2023), Tested-143k-Python-Alpaca (Alpaca; Vezora 2023), Glaive Code-Assistant (Glaive; Computer 2023), Magicoder-Evol Instruct-110K (Evol-Ins; UIUC 2023a), Dolphin-Coder (Dolphin; Computations 2023), Magicoder-OSS-Instruct75K (Oss-Ins; UIUC 2023b), Self-OSS-Instruct-SC2-Exec Filter-50K (Self-Oss; Big Code 2023), and Apps (Hendrycks et al., 2021).
Dataset Splits	No	The paper mentions collecting candidate tasks from various open-source datasets and filtering them, resulting in a total of 65k tasks. It also refers to evaluating on existing benchmarks like Effi Bench and Human Eval Plus. Footnote 2 states: 'Analysis shows no exact duplicates between training and evaluation sets, with only 0.20% of evaluation samples having minimal vocabulary overlap (5-10%).' However, specific percentages, absolute counts, or detailed methodologies for splitting the EFFIINSTRUCT dataset itself into training, validation, and test sets are not explicitly provided in the main text.
Hardware Specification	Yes	Firstly, we have evaluated the effectiveness of Effi-Code on seven different software-hardware setups, as shown in Rebuttal Table 2. The results demonstrate that Effi-Code fine-tuned LLMs achieve higher efficiency than the original LLMs across all setups. For example, in the environment of Python 3.11.10 Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz, the average execution time decreases from 0.59s to 0.40s when using Effi-Code to fine-tune Qwen2.5-Coder-7B, reducing the average execution time by 32%.
Software Dependencies	Yes	We use Llama-factory (Zheng et al., 2024) to fine-tune LLMs with fully supervised fine-tuning with the same setup and train the models using EFFIINSTRUCT. ... Python 3.11.10 Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz
Experiment Setup	Yes	The maximum sequence length is set to 2048 tokens. We use a batch size of 128 and set the learning rate to 5e-6 with a cosine learning rate scheduler and a warmup ratio of 0.03. We fine-tune all LLMs for 4 epochs under the bf16 data type.