reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MathConstruct: Challenging LLM Reasoning with Constructive Proofs

Authors: Mislav Balunovic, Jasper Dekoninck, Nikola Jovanović, Ivo Petrov, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sec. 4, we evaluate state-of-the-art LLMs on MATHCONSTRUCT, including GPT-4O and O3MINI (Jaech et al., 2024), the GEMINI family (Reid et al., 2024), and the CLAUDE family (Anthropic, 2024). By generating variations of each problem, we evaluate the models on 475 distinct problem instances. Even with access to code execution, these models struggle with MATHCONSTRUCT. The best model achieves only 54% accuracy, as shown in Fig. 1; complete results are presented in Table 2.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich 2INSAIT, Sofia University "St. Kliment Ohridski".
Pseudocode	Yes	Figure 3: A problem encoding consisting of a symbolic problem statement P, concrete parameters θ, and a verification function Vθ. We also implement a variation generator, and provide formatting instructions for the construction. ... def verify(construction): ... def generate_variation():
Open Source Code	Yes	Our code is available at https: //github.com/eth-sri/mathconstruct.
Open Datasets	Yes	We introduce MATHCONSTRUCT, a new benchmark of 126 challenging problems sourced from olympiad-level mathematics competitions. ... Our code is available at https: //github.com/eth-sri/mathconstruct.
Dataset Splits	No	No specific training/test/validation dataset splits are provided in the traditional machine learning sense. The paper mentions generating "475 distinct problem instances" from "126 unique problems" by varying parameters for evaluation, but this is a method of generating evaluation instances, not a dataset split for model training/testing.
Hardware Specification	Yes	To safely execute untrusted LLM code, we conducted all experiments within a lightweight Docker container. Any generated code was executed on a single core of an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 1GB of RAM.
Software Dependencies	No	The coding agents operated in an isolated environment without network access, restricted to using only the standard Python libraries along with numpy, scipy, and sympy libraries.
Experiment Setup	Yes	Each model was queried with a temperature of 1 and nucleus sampling with parameter top_p = 0.9. We use a maximum output length of 16000 tokens, except for the O1 model family and for FLASH-THINKING, where we increased this to respectively 32000 and unlimited number of tokens. ... To ensure correct parsing, models receive two rounds of feedback from the parser, allowing them to refine their answers. ... Specifically, each model can execute Python code up to four times per problem to generate solutions, verify reasoning steps, or perform calculations.