EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Authors: Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, Ofir Press

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent s performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and Cy Bench.
Researcher Affiliation Academia 1Tel Aviv University 2NYU Tandon School of Engineering 3Princeton Language and Intelligence, Princeton University 4Stanford University 5New York University Abu Dhabi.
Pseudocode No The paper describes the agent's architecture and tools, such as the thought-action-observation loop, and provides tables of commands (Table 8), but does not present structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and development dataset are available at https: //github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_ Bench/tree/main/development respectively.
Open Datasets Yes Our code and development dataset are available at https: //github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_ Bench/tree/main/development respectively. We extensively evaluate En IGMA on four benchmarks: NYU CTF (Shao et al., 2024b), Inter Code-CTF (Yang et al., 2023b), Cy Bench (Zhang et al., 2024) and Hack The Box (HTB) benchmark we collected.
Dataset Splits Yes We constructed a development set of 55 CTF challenges sourced from the CSAW competition... while the NYU CTF benchmark is sourced from competitions from 2017 to 2023, so there is no overlap. We extensively evaluate En IGMA on four benchmarks: NYU CTF (Shao et al., 2024b), Inter Code-CTF (Yang et al., 2023b), Cy Bench (Zhang et al., 2024) and Hack The Box (HTB) benchmark we collected.
Hardware Specification No We use Microsoft Azure Open AI (Microsoft Azure, 2024) for Open AI models, the Anthropic inference API (Anthropic, 2024a) for Claude and the Together AI API for LLa MA 3.1 model (Together AI, 2024). This describes the API services used, not the specific hardware.
Software Dependencies No The container comes with preinstalled software and python packages that are useful for solving these challenges including: pwntools, radare2, wine, wine32, gmpy2, sagemath, pycryptodome, sympy, Rsa Ctf Tool.py, tshark, sqlmap and nikto.
Experiment Setup Yes The temperature is set to T = 0, and we use nucleus sampling with p = 0.95 for all models. The budget per instance is limited to $3; if a run exceeds this budget, the instance is marked as unsolved due to cost constraints (exit_cost).