Synthesizing Software Engineering Data in a Test-Driven Manner

Authors: Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. We conducted a comprehensive evaluation of 11 mainstream LLMs on the SWE-Flow-Bench (Lite) benchmark.
Researcher Affiliation Collaboration 1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 2University of the Chinese Academy of Sciences, Beijing, China 3Alibaba Group, Beijing, China 4Zhejiang University, Hangzhou, China 5University of Science and Technology of China, Hefei, China.
Pseudocode Yes Algorithm 1: The procedure of SWE-Flow-Trace; Algorithm 2: The procedure of SWE-Flow-Schedule
Open Source Code Yes To facilitate further research, we release all code, datasets, models, and Docker images at Github. We publicly release all code, models, datasets, and Docker images, fostering further research in the community.
Open Datasets Yes With this approach, we generated 16,061 training instances and 2,020 test instances from real-world Git Hub projects, creating the SWE-Flow-Bench benchmark. We publicly release all code, models, datasets, and Docker images, fostering further research in the community.
Dataset Splits Yes Following this process, we synthesized a comprehensive dataset that includes 16,061 training instances and 2,020 test instances, tailored to improve and evaluate the performance of AI systems in real-world software development scenarios. To facilitate efficient validation, SWE-Flow-Bench is divided into two splits: Full and Lite. The Full split includes all 2,020 development tasks. The Lite split contains only the first 50 development steps from each software project (or all available steps if a project has fewer than 50 development steps), resulting in a total of 589 development tasks.
Hardware Specification Yes The entire training process was completed within two hours on 128 H800 GPUs using Megatron-LM (Shoeybi et al., 2019).
Software Dependencies No The entire training process was completed within two hours on 128 H800 GPUs using Megatron-LM (Shoeybi et al., 2019). The following command demonstrates how SWE-Flow-Trace executes unit tests in a Python project via the terminal: sweflow-trace pytest test_case_id. The text mentions 'Megatron-LM', 'Python', and 'pytest' but does not specify their version numbers.
Experiment Setup Yes For a detailed description of the training process and parameters, please refer to Appendix D. Table 4. Fine-tuning parameters. Parameter Max Seq-len Batch Size Training Steps Warmup Steps Learning Rate Min LR LR Decay Value 32,768 1024 32 6 7e-6 7e-7 Linear