HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Authors: Fan, Sarah Martinson, Erik Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael Brenner

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate both openand closedsource LLMs on HARDMATH-MINI, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with fewshot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets.
Researcher Affiliation Academia Jingxuan Fan , Sarah Martinson , Erik Y. Wang , Kaylie Hausknecht , Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner School of Engineering and Applied Sciences, Harvard University
Pseudocode No The paper describes algorithms to automatically generate problems and their step-by-step solutions and presents a flowchart for the data generation procedure (Fig. 2), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Dataset: https://github.com/sarahmart/HARDMath
Open Datasets Yes To address this gap, we introduce HARDMATH, a dataset specifically designed to focus on asymptotic reasoning in mathematics. This dataset captures a fundamentally different type of mathematical reasoning compared to other benchmarks and can be useful for evaluating LLMs abilities to make research-relevant approximations.
Dataset Splits Yes The main HARDMATH dataset, which can be used for model developments (e.g. novel prompting techniques or fine-tuning), contains 1,060 problems, and the evaluation dataset HARDMATHMINI, which we use in this paper to benchmark LLM performance, contains 366 problems.
Hardware Specification Yes Evaluations of open-source models on HARDMATH are conducted on a high-performance compute cluster with a single Tesla V100 GPU (16GB vram). Evaluation on one problem type typically takes less than 1 hour.
Software Dependencies No Code for data generation uses Sym Py (Meurer et al., 2017), a library for symbolic mathematics, and Sci Py, a library for scientific computing (Virtanen et al., 2020), to implement the mathematical procedures required for obtaining approximate, analytical solutions. However, specific version numbers for these libraries are not provided.
Experiment Setup Yes We compare the performance of several closedand open-source models on HARDMATH in zeroand few-shot settings with the Chain-of-Thought (Co T) (Wei et al., 2023) prompting. We provide the prompts and hyper-parameters for LLMs evaluations in Appendix A.3.4 Table 7.