reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Authors: Kush Jain, Gabriel Synnaeve, Baptiste Roziere

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TESTGENEVAL s contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%.
Researcher Affiliation	Collaboration	Kush Jain1,2, Gabriel Synnaeve 2, Baptiste Rozi ere2 1 Carnegie Mellon University, 2 FAIR, Meta AI
Pseudocode	No	The paper includes figures describing a pipeline and examples of code and model generations, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide all the code for our benchmark at https://figshare.com/s/ 51171ae97cd21d233d4f, including detailed instructions on how to run our benchmark, and even extend it. We also provide a website with all model generations for TESTGENEVAL.
Open Datasets	Yes	We release a benchmark for partial and full test suite generation on a realistic set of 1,210 snippets in 11 repositories. We provide all the code for our benchmark at https://figshare.com/s/ 51171ae97cd21d233d4f, including detailed instructions on how to run our benchmark, and even extend it.
Dataset Splits	Yes	Our benchmark consists of real world projects, with each source file containing an average 1,157 lines of code (LOC) and each test file containing an average of 943 LOC. TESTGENEVAL consists of 68,647 tests from 1,210 unique code-tests file pairs. For fast iteration in low-compute settings, we also provide a smaller version of the benchmark TESTGENEVALLITE, which approximates all the metrics computed in TESTGENEVAL (see Appendix E for more details). TESTGENEVALLITE includes 160 code-tests file pairs, file unit test generation, and test completion tasks. It was sampled to be representative of the full TESTGENEVAL: the repositories and other statistics are similar in TESTGENEVALLITE and TESTGENEVAL (see Appendix A for more details and Appendix G for statistical significance tests).
Hardware Specification	No	The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments.
Software Dependencies	No	We use cosmic-ray to generate mutants and use the default set of mutation operators. This default set of operators follows best practices defined by the mutation testing community (Derezinska & Halas, 2014; Offutt et al., 1996).
Experiment Setup	Yes	We prompt each model with the maximum context window size possible, otherwise truncate the starting tokens to fit the prompt in the context window. We report results for all models in both the full test generation (Section 3.1) and test completion tasks (Section 3.2) on TESTGENEVAL. For test suite generation we report any pass@1 (if any of the tests in the generated test suite pass), all pass@1 (if the generated test suite passes), coverage (coverage of passing tests), and mutation score (proportion of synthetic bugs introduced to code caught by test suite). For test completion we report pass@1 and pass@5 (whether generated test passes), along with coverage improvement from adding the generated test. More detailed descriptions and our full set of metrics can be found at Appendix D.2. Our full set of results for TESTGENEVAL and results for TESTGENEVALLITE can be found in Appendix E. We also perform statistical significance tests and report 95% confidence intervals in Appendix G. Table 2 and 3 show all results for temperature=0.2 and Pass@5 is for temperature=0.8.