reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EpiCoder: Encompassing Diversity and Complexity in Code Generation

Authors: Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We fine-tuned widely-used base models to obtain Epi Coder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available.
Researcher Affiliation	Collaboration	1School of Informatics, Xiamen University 2Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China 3Tsinghua University 4Microsoft. Correspondence to: Xin Zhang <EMAIL>, Yujiu Yang <EMAIL>, Jinsong Su <EMAIL>.
Pseudocode	Yes	Algorithm 1 Feature Sampling for Task Generation 1: Input: Current root node R, frequency library F, temperature t, sample size S 2: Output: Selected feature set 3: selected_set 4: for s = 1 to S do 5: C get_children(R) 6: if C = then 7: break 8: end if 9: fi F[i] for all i C 10: pi fi P Algorithm 2 Feature Evolution with Frequency Estimation 1: Input: Feature tree T, frequency library F containing the frequency of each node in T, maximum steps N 2: Output: Updated frequency library F 3: for step = 1 to N do 4: t sample(T) {Sample a subtree t from T} 5: expanded_t LLM.evolve(t) {Evolve t along depth and breadth} 6: for each node expanded_t \ t do
Open Source Code	Yes	Our code and data are publicly available.1 1https://github.com/microsoft/Epi Coder and https:// github.com/Deep Learn XMU/Epi Coder
Open Datasets	Yes	Raw Code Collection To ensure data diversity and comprehensiveness, we obtain seed data from The Stack v2 (Lozhkov et al., 2024), a publicly available large-scale dataset widely used for pre-training code LLMs.
Dataset Splits	No	The paper mentions synthesizing 380k function-level data samples and 53k file-level data samples for training, and evaluates on standard benchmarks like Human Eval, MBPP, Big Code Bench, Evo Eval, Full Stack Bench, and XFile Dep. However, it does not explicitly provide specific train/test/validation split percentages or counts for its own synthesized training data, nor does it explicitly detail the splits used for the evaluation benchmarks, only referring to them as being used for evaluation.
Hardware Specification	No	The paper states that 'All open-source models were accelerated using vLLM (Kwon et al., 2023)', but it does not specify any particular GPU models, CPU models, memory amounts, or other detailed computer specifications used for running the experiments.
Software Dependencies	Yes	We choose Deep Seek-Coder-Base-6.7B (Guo et al., 2024) and Qwen2.5-Coder-7B (Hui et al., 2024) as the base LLMs and obtain the Epi Coder-DS-6.7B and Epi Coder Qwen-7B models after training. To extract features from the seed data, we leverage a powerful large language model (LLM), specifically GPT-4o. Unless otherwise specified, the strong LLM refers to GPT-4o. All open-source models were accelerated using vLLM (Kwon et al., 2023).
Experiment Setup	Yes	During the testing of different models, we consistently applied their default prompts and greedy decoding, with a maximum token length of 8192. A higher temperature value t leads to a smoother distribution, allowing less dominant features a higher probability of being selected. To further enhance the diversity of the generated data, we employed multiple temperature values during the data synthesis process for a wider range of feature distributions.