reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Otter: Generating Tests from Issues to Validate SWE Patches

Authors: Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, Martin Hirzel

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.
Researcher Affiliation	Industry	Toufique Ahmed 1 Jatin Ganhotra 1 Rangeet Pan 1 Avraham Shinnar 1 Saurabh Sinha 1 Martin Hirzel 1 1IBM Research, Yorktown Heights, New York, USA. Correspondence to: Toufique Ahmed <EMAIL>, Martin Hirzel <EMAIL>.
Pseudocode	No	The paper describes the methodology of Otter in Section 4, broken down into Localizer, Self-Reflective Action Planner, and Test Generator components. It includes flow diagrams (Figure 1) and LLM prompts (Figures 14-21 in Appendix), but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	TDD-Bench-Verified and Otter generated tests are at https://github.com/IBM/TDD-Bench-Verified.
Open Datasets	Yes	To evaluate solutions such as Otter that automatically generate tests from issues, we introduce a new benchmark, TDD-Bench-Verified. TDD-Bench-Verified evaluates tests by checking whether the tests a) fail on the old code before issue resolution, b) pass on the new code, and c) cover the code changes well. The fact that TDD-Bench-Verified is derived from SWE-bench Verified enables us to empirically study the effects of generated tests on SWE agents. ... TDD-Bench-Verified and Otter generated tests are at https://github.com/IBM/TDD-Bench-Verified.
Dataset Splits	No	TDD-Bench-Verified is a new benchmark that supports evaluation of techniques for generating tests from an issue description and an old code version... In the end, 449 high-quality instances remain across 12 repositories. (The paper describes a dataset and how it was curated but does not provide explicit train/test/validation splits for model training or evaluation splits beyond using the entire 449 instances for evaluation.)
Hardware Specification	No	The evaluation used the closed-source GPT-4o (gpt-4o-202408-06) and the open-source Mistral-large model (123 billion parameters). All experiments used greedy decoding. For each instance, Otter makes 7–11 LLM calls for T1. Otter++ makes one additional call for each of the other four tests (T2-T5) after the localization stage. To evaluate using the generated tests for SWE agents, we conducted a large-scale experiment with 22 systems from the SWE-Bench leaderboard. We ran 22 × 449 × 5 = 49,390 Docker containers or tests (one Docker container per test) to report the results. ... We do not discuss the cost for Mistral-large because the model was hosted locally. (The paper mentions using GPT-4o and Mistral-large, and that Mistral was hosted locally, but provides no specific details about the hardware specifications such as GPU or CPU models.)
Software Dependencies	No	We integrated the Python Coverage package into the 12 repositories and updated the test scripts to allow us to run specific test cases and compute coverage for them. ... Otter includes an import-fixing step in this phase, where it looks at model-generated imports and linting errors detected using Flake8 (a static analysis tool) to identify missing imports. (The paper mentions the Python Coverage package and Flake8 but does not provide specific version numbers for them.)
Experiment Setup	Yes	All experiments used greedy decoding. For each instance, Otter makes 7–11 LLM calls for T1. Otter++ makes one additional call for each of the other four tests (T2-T5) after the localization stage.