reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Authors: Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, Ofir Press

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent s performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and Cy Bench.
Researcher Affiliation	Academia	1Tel Aviv University 2NYU Tandon School of Engineering 3Princeton Language and Intelligence, Princeton University 4Stanford University 5New York University Abu Dhabi.
Pseudocode	No	The paper describes the agent's architecture and tools, such as the thought-action-observation loop, and provides tables of commands (Table 8), but does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and development dataset are available at https: //github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_ Bench/tree/main/development respectively.
Open Datasets	Yes	Our code and development dataset are available at https: //github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_ Bench/tree/main/development respectively. We extensively evaluate En IGMA on four benchmarks: NYU CTF (Shao et al., 2024b), Inter Code-CTF (Yang et al., 2023b), Cy Bench (Zhang et al., 2024) and Hack The Box (HTB) benchmark we collected.
Dataset Splits	Yes	We constructed a development set of 55 CTF challenges sourced from the CSAW competition... while the NYU CTF benchmark is sourced from competitions from 2017 to 2023, so there is no overlap. We extensively evaluate En IGMA on four benchmarks: NYU CTF (Shao et al., 2024b), Inter Code-CTF (Yang et al., 2023b), Cy Bench (Zhang et al., 2024) and Hack The Box (HTB) benchmark we collected.
Hardware Specification	No	We use Microsoft Azure Open AI (Microsoft Azure, 2024) for Open AI models, the Anthropic inference API (Anthropic, 2024a) for Claude and the Together AI API for LLa MA 3.1 model (Together AI, 2024). This describes the API services used, not the specific hardware.
Software Dependencies	No	The container comes with preinstalled software and python packages that are useful for solving these challenges including: pwntools, radare2, wine, wine32, gmpy2, sagemath, pycryptodome, sympy, Rsa Ctf Tool.py, tshark, sqlmap and nikto.
Experiment Setup	Yes	The temperature is set to T = 0, and we use nucleus sampling with p = 0.95 for all models. The budget per instance is limited to $3; if a run exceeds this budget, the instance is marked as unsolved due to cost constraints (exit_cost).