Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models
Authors: Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, runxin xu, Zhengyang Tang, Wang Benyou, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, Baobao Chang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs mathematical reasoning at the Olympiad level. ... Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, Open AI o1-mini and Open AI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning. |
| Researcher Affiliation | Collaboration | 1Peking University 2University of Wisconsin Madison 3Alibaba Group 4Shanghai Jiao Tong University 5Engineering Research Center of Information Networks 6The Chinese University of Hong Kong, Shenzhen 7Institute of Software, Chinese Academy of Sciences 8University of Waterloo 9The University of Hong Kong 10Zhongguancun Laboratory |
| Pseudocode | No | The paper describes methods and processes (e.g., data collection, evaluation), but does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Github Repo [Git Hub Page] Rule-based Repo [Git Hub Page] Project Page [Project Page & Leaderboard] Dataset [Huggingface Dataset] Omni Judge [Huggingface Model] |
| Open Datasets | Yes | Dataset [Huggingface Dataset] |
| Dataset Splits | Yes | Specifically, we first constructed a dataset for training ( 17618), validation ( 2200), and test ( 2200) based on evaluation results from GPT-4o, which have no overlaps of questions with each other. |
| Hardware Specification | No | The paper mentions using "vllm framework" and setting token limits for API models, but it does not specify any particular CPU or GPU models, or detailed hardware specifications for running the experiments. |
| Software Dependencies | No | The paper mentions using "vllm framework" and "Sym Py", but does not provide specific version numbers for these or any other software libraries or tools. |
| Experiment Setup | Yes | To mitigate randomness in the responses, we set the parameters as follows: temperature = 0, top p = 1, and a maximum of 2048 tokens. For the O1-preview and O1-mini, due to constraints of inference costs, we configured the maximum completion tokens to 4096. |