TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Authors: Kush Jain, Gabriel Synnaeve, Baptiste Roziere

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TESTGENEVAL s contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%.
Researcher Affiliation Collaboration Kush Jain1,2, Gabriel Synnaeve 2, Baptiste Rozi ere2 1 Carnegie Mellon University, 2 FAIR, Meta AI
Pseudocode No The paper includes figures describing a pipeline and examples of code and model generations, but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We provide all the code for our benchmark at https://figshare.com/s/ 51171ae97cd21d233d4f, including detailed instructions on how to run our benchmark, and even extend it. We also provide a website with all model generations for TESTGENEVAL.
Open Datasets Yes We release a benchmark for partial and full test suite generation on a realistic set of 1,210 snippets in 11 repositories. We provide all the code for our benchmark at https://figshare.com/s/ 51171ae97cd21d233d4f, including detailed instructions on how to run our benchmark, and even extend it.
Dataset Splits Yes Our benchmark consists of real world projects, with each source file containing an average 1,157 lines of code (LOC) and each test file containing an average of 943 LOC. TESTGENEVAL consists of 68,647 tests from 1,210 unique code-tests file pairs. For fast iteration in low-compute settings, we also provide a smaller version of the benchmark TESTGENEVALLITE, which approximates all the metrics computed in TESTGENEVAL (see Appendix E for more details). TESTGENEVALLITE includes 160 code-tests file pairs, file unit test generation, and test completion tasks. It was sampled to be representative of the full TESTGENEVAL: the repositories and other statistics are similar in TESTGENEVALLITE and TESTGENEVAL (see Appendix A for more details and Appendix G for statistical significance tests).
Hardware Specification No The paper does not specify any particular hardware (GPU, CPU models, etc.) used for running the experiments.
Software Dependencies No We use cosmic-ray to generate mutants and use the default set of mutation operators. This default set of operators follows best practices defined by the mutation testing community (Derezinska & Halas, 2014; Offutt et al., 1996).
Experiment Setup Yes We prompt each model with the maximum context window size possible, otherwise truncate the starting tokens to fit the prompt in the context window. We report results for all models in both the full test generation (Section 3.1) and test completion tasks (Section 3.2) on TESTGENEVAL. For test suite generation we report any pass@1 (if any of the tests in the generated test suite pass), all pass@1 (if the generated test suite passes), coverage (coverage of passing tests), and mutation score (proportion of synthetic bugs introduced to code caught by test suite). For test completion we report pass@1 and pass@5 (whether generated test passes), along with coverage improvement from adding the generated test. More detailed descriptions and our full set of metrics can be found at Appendix D.2. Our full set of results for TESTGENEVAL and results for TESTGENEVALLITE can be found in Appendix E. We also perform statistical significance tests and report 95% confidence intervals in Appendix G. Table 2 and 3 show all results for temperature=0.2 and Pass@5 is for temperature=0.8.