MathConstruct: Challenging LLM Reasoning with Constructive Proofs
Authors: Mislav Balunovic, Jasper Dekoninck, Nikola Jovanović, Ivo Petrov, Martin Vechev
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sec. 4, we evaluate state-of-the-art LLMs on MATHCONSTRUCT, including GPT-4O and O3MINI (Jaech et al., 2024), the GEMINI family (Reid et al., 2024), and the CLAUDE family (Anthropic, 2024). By generating variations of each problem, we evaluate the models on 475 distinct problem instances. Even with access to code execution, these models struggle with MATHCONSTRUCT. The best model achieves only 54% accuracy, as shown in Fig. 1; complete results are presented in Table 2. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Zurich 2INSAIT, Sofia University "St. Kliment Ohridski". |
| Pseudocode | Yes | Figure 3: A problem encoding consisting of a symbolic problem statement P, concrete parameters θ, and a verification function Vθ. We also implement a variation generator, and provide formatting instructions for the construction. ... def verify(construction): ... def generate_variation(): |
| Open Source Code | Yes | Our code is available at https: //github.com/eth-sri/mathconstruct. |
| Open Datasets | Yes | We introduce MATHCONSTRUCT, a new benchmark of 126 challenging problems sourced from olympiad-level mathematics competitions. ... Our code is available at https: //github.com/eth-sri/mathconstruct. |
| Dataset Splits | No | No specific training/test/validation dataset splits are provided in the traditional machine learning sense. The paper mentions generating "475 distinct problem instances" from "126 unique problems" by varying parameters for evaluation, but this is a method of generating evaluation instances, not a dataset split for model training/testing. |
| Hardware Specification | Yes | To safely execute untrusted LLM code, we conducted all experiments within a lightweight Docker container. Any generated code was executed on a single core of an Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz with 1GB of RAM. |
| Software Dependencies | No | The coding agents operated in an isolated environment without network access, restricted to using only the standard Python libraries along with numpy, scipy, and sympy libraries. |
| Experiment Setup | Yes | Each model was queried with a temperature of 1 and nucleus sampling with parameter top_p = 0.9. We use a maximum output length of 16000 tokens, except for the O1 model family and for FLASH-THINKING, where we increased this to respectively 32000 and unlimited number of tokens. ... To ensure correct parsing, models receive two rounds of feedback from the parser, allowing them to refine their answers. ... Specifically, each model can execute Python code up to four times per problem to generate solutions, verify reasoning steps, or perform calculations. |