reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BaxBench: Can LLMs Generate Correct and Secure Backends?

Authors: Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 11 state-of-the-art LLMs on BAXBENCH, including reasoning models, such as OPENAI O3-MINI (Open AI, 2025) and DEEPSEEKR1 (Guo et al., 2025). As shown in Fig. 1, even flagship LLMs struggle to generate deployment-ready backends, not surpassing a mere 37% correct and secure generation rate on BAXBENCH. But security is not the only challenge that BAXBENCH poses to the models, even only in terms of functional correctness, the models struggle to fulfill the task in 40% of the cases.
Researcher Affiliation	Collaboration	1Department of Computer Science, ETH Zurich, Zurich, Switzerland 2Logic Star.ai, Zurich, Switzerland 3UC Berkeley, Berkeley, California, United States 4INSAIT, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria.
Pseudocode	No	The paper describes the BAXBENCH framework, its scenarios, and evaluation methodology, but does not present pseudocode or algorithm blocks for its own research methods.
Open Source Code	Yes	1Leaderboard, code, and dataset: https://baxbench.com/
Open Datasets	Yes	1Leaderboard, code, and dataset: https://baxbench.com/
Dataset Splits	No	Each task in BAXBENCH is a combination of a scenario and a framework. The LLMs are prompted with scenario specifications in Open API format, and with the programming language and available packages defined by the framework. Our evaluation prompt templates are included in App. C. Next, we evaluate the LLM-generated code for correctness and security using the above tests and exploits. The paper evaluates pre-trained models on a fixed set of benchmark tasks, not training new models with dataset splits.
Hardware Specification	No	Each test/exploit is executed in a Docker environment. This enables the reproducibility of the results, and ensures that the security exploits on the LLM-generated code cannot harm the benchmarking infrastructure. The paper does not specify the hardware (e.g., GPU/CPU models) used for running these Docker environments or the LLM inference.
Software Dependencies	Yes	You are not allowed to use anything besides the standard library and the following packages: annotated-types==0.7.0 anyio==4.7.0 certifi==2024.12.14 click==8.1.7 dnspython==2.7.0 email_validator==2.2.0 fastapi==0.115.6 fastapi-cli==0.0.7 h11==0.14.0 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 Jinja2==3.1.4 python-jose==3.3.0 markdown-it-py==3.0.0 Markup Safe==3.0.2 mdurl==0.1.2 pydantic==2.10.3 pydantic_core==2.27.1 Pygments==2.18.0 python-dotenv==1.0.1 python-multipart==0.0.20 Py YAML==6.0.2 rich==13.9.4 rich-toolkit==0.12.0 shellingham==1.5.4 sniffio==1.3.1 starlette[full]==0.41.3 typer==0.15.1 typing_extensions==4.12.2 uvicorn==0.34.0 uvloop==0.21.0 watchfiles==1.0.3 websockets==14.1 passlib==1.7.4 pyjwt==2.10.0
Experiment Setup	Yes	Experimental Setup We test 11 state-of-the-art LLMs on BAXBENCH: OPENAI O1 (Jaech et al., 2024), OPENAI O3-MINI (Open AI, 2025), GPT-4O (Hurst et al., 2024), CLAUDE-3.5 SONNET (Anthropic, 2024), DEEPSEEK-R1 (Guo et al., 2025), DEEPSEEK-V3 (Liu et al., 2024a), CODESTRAL (Mistral AI, 2024), QWEN2.5 CODER (Hui et al., 2024), LLAMA-3.3 70B (Dubey et al., 2024), QWEN2.5 72B (Yang et al., 2024a), and QWEN2.5 7B (Yang et al., 2024a) 6 providers, 4 closed-source, and 7 open-source models. For each task, we sample 10 solutions from all non-reasoning models at temperature 0.4. For the reasoning models, OPENAI O1, OPENAI O3-MINI, and DEEPSEEK-R1, we sample only 1 solution, as they are both cost and time-intensive to evaluate. We use temperature 0 for DEEPSEEK-R1, while for OPENAI O1 and OPENAI O3-MINI, there is no modifiable temperature parameter. The functionality instructions are provided as Open API specifications. We show the advantage of these exact specifications against plaintext descriptions in a separate experiment, justifying our choice. Following prior work (Chen et al., 2021; Fu et al., 2024), we measure the models performance using the pass@k and sec_pass@k metrics, with k = 1 in the main paper.