HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Authors: Fan, Sarah Martinson, Erik Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael Brenner
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate both openand closedsource LLMs on HARDMATH-MINI, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with fewshot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. |
| Researcher Affiliation | Academia | Jingxuan Fan , Sarah Martinson , Erik Y. Wang , Kaylie Hausknecht , Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner School of Engineering and Applied Sciences, Harvard University |
| Pseudocode | No | The paper describes algorithms to automatically generate problems and their step-by-step solutions and presents a flowchart for the data generation procedure (Fig. 2), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset: https://github.com/sarahmart/HARDMath |
| Open Datasets | Yes | To address this gap, we introduce HARDMATH, a dataset specifically designed to focus on asymptotic reasoning in mathematics. This dataset captures a fundamentally different type of mathematical reasoning compared to other benchmarks and can be useful for evaluating LLMs abilities to make research-relevant approximations. |
| Dataset Splits | Yes | The main HARDMATH dataset, which can be used for model developments (e.g. novel prompting techniques or fine-tuning), contains 1,060 problems, and the evaluation dataset HARDMATHMINI, which we use in this paper to benchmark LLM performance, contains 366 problems. |
| Hardware Specification | Yes | Evaluations of open-source models on HARDMATH are conducted on a high-performance compute cluster with a single Tesla V100 GPU (16GB vram). Evaluation on one problem type typically takes less than 1 hour. |
| Software Dependencies | No | Code for data generation uses Sym Py (Meurer et al., 2017), a library for symbolic mathematics, and Sci Py, a library for scientific computing (Virtanen et al., 2020), to implement the mathematical procedures required for obtaining approximate, analytical solutions. However, specific version numbers for these libraries are not provided. |
| Experiment Setup | Yes | We compare the performance of several closedand open-source models on HARDMATH in zeroand few-shot settings with the Chain-of-Thought (Co T) (Wei et al., 2023) prompting. We provide the prompts and hyper-parameters for LLMs evaluations in Appendix A.3.4 Table 7. |