reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthesizing Software Engineering Data in a Test-Driven Manner

Authors: Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. We conducted a comprehensive evaluation of 11 mainstream LLMs on the SWE-Flow-Bench (Lite) benchmark.
Researcher Affiliation	Collaboration	1Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 2University of the Chinese Academy of Sciences, Beijing, China 3Alibaba Group, Beijing, China 4Zhejiang University, Hangzhou, China 5University of Science and Technology of China, Hefei, China.
Pseudocode	Yes	Algorithm 1: The procedure of SWE-Flow-Trace; Algorithm 2: The procedure of SWE-Flow-Schedule
Open Source Code	Yes	To facilitate further research, we release all code, datasets, models, and Docker images at Github. We publicly release all code, models, datasets, and Docker images, fostering further research in the community.
Open Datasets	Yes	With this approach, we generated 16,061 training instances and 2,020 test instances from real-world Git Hub projects, creating the SWE-Flow-Bench benchmark. We publicly release all code, models, datasets, and Docker images, fostering further research in the community.
Dataset Splits	Yes	Following this process, we synthesized a comprehensive dataset that includes 16,061 training instances and 2,020 test instances, tailored to improve and evaluate the performance of AI systems in real-world software development scenarios. To facilitate efficient validation, SWE-Flow-Bench is divided into two splits: Full and Lite. The Full split includes all 2,020 development tasks. The Lite split contains only the first 50 development steps from each software project (or all available steps if a project has fewer than 50 development steps), resulting in a total of 589 development tasks.
Hardware Specification	Yes	The entire training process was completed within two hours on 128 H800 GPUs using Megatron-LM (Shoeybi et al., 2019).
Software Dependencies	No	The entire training process was completed within two hours on 128 H800 GPUs using Megatron-LM (Shoeybi et al., 2019). The following command demonstrates how SWE-Flow-Trace executes unit tests in a Python project via the terminal: sweflow-trace pytest test_case_id. The text mentions 'Megatron-LM', 'Python', and 'pytest' but does not specify their version numbers.
Experiment Setup	Yes	For a detailed description of the training process and parameters, please refer to Appendix D. Table 4. Fine-tuning parameters. Parameter Max Seq-len Batch Size Training Steps Warmup Steps Learning Rate Min LR LR Decay Value 32,768 1024 32 6 7e-6 7e-7 Linear