EpiCoder: Encompassing Diversity and Complexity in Code Generation
Authors: Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fine-tuned widely-used base models to obtain Epi Coder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available. |
| Researcher Affiliation | Collaboration | 1School of Informatics, Xiamen University 2Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China 3Tsinghua University 4Microsoft. Correspondence to: Xin Zhang <EMAIL>, Yujiu Yang <EMAIL>, Jinsong Su <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Feature Sampling for Task Generation 1: Input: Current root node R, frequency library F, temperature t, sample size S 2: Output: Selected feature set 3: selected_set 4: for s = 1 to S do 5: C get_children(R) 6: if C = then 7: break 8: end if 9: fi F[i] for all i C 10: pi fi P Algorithm 2 Feature Evolution with Frequency Estimation 1: Input: Feature tree T, frequency library F containing the frequency of each node in T, maximum steps N 2: Output: Updated frequency library F 3: for step = 1 to N do 4: t sample(T) {Sample a subtree t from T} 5: expanded_t LLM.evolve(t) {Evolve t along depth and breadth} 6: for each node expanded_t \ t do |
| Open Source Code | Yes | Our code and data are publicly available.1 1https://github.com/microsoft/Epi Coder and https:// github.com/Deep Learn XMU/Epi Coder |
| Open Datasets | Yes | Raw Code Collection To ensure data diversity and comprehensiveness, we obtain seed data from The Stack v2 (Lozhkov et al., 2024), a publicly available large-scale dataset widely used for pre-training code LLMs. |
| Dataset Splits | No | The paper mentions synthesizing 380k function-level data samples and 53k file-level data samples for training, and evaluates on standard benchmarks like Human Eval, MBPP, Big Code Bench, Evo Eval, Full Stack Bench, and XFile Dep. However, it does not explicitly provide specific train/test/validation split percentages or counts for its own synthesized training data, nor does it explicitly detail the splits used for the evaluation benchmarks, only referring to them as being used for evaluation. |
| Hardware Specification | No | The paper states that 'All open-source models were accelerated using vLLM (Kwon et al., 2023)', but it does not specify any particular GPU models, CPU models, memory amounts, or other detailed computer specifications used for running the experiments. |
| Software Dependencies | Yes | We choose Deep Seek-Coder-Base-6.7B (Guo et al., 2024) and Qwen2.5-Coder-7B (Hui et al., 2024) as the base LLMs and obtain the Epi Coder-DS-6.7B and Epi Coder Qwen-7B models after training. To extract features from the seed data, we leverage a powerful large language model (LLM), specifically GPT-4o. Unless otherwise specified, the strong LLM refers to GPT-4o. All open-source models were accelerated using vLLM (Kwon et al., 2023). |
| Experiment Setup | Yes | During the testing of different models, we consistently applied their default prompts and greedy decoding, with a maximum token length of 8192. A higher temperature value t leads to a smoother distribution, allowing less dominant features a higher probability of being selected. To further enhance the diversity of the generated data, we employed multiple temperature values during the data synthesis process for a wider range of feature distributions. |