reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

Authors: Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, runxin xu, Zhengyang Tang, Wang Benyou, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, Baobao Chang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To bridge this gap, we propose a comprehensive and challenging benchmark specifically designed to assess LLMs mathematical reasoning at the Olympiad level. ... Furthermore, we conducted an in-depth analysis based on this benchmark. Our experimental results show that even the most advanced models, Open AI o1-mini and Open AI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
Researcher Affiliation	Collaboration	1Peking University 2University of Wisconsin Madison 3Alibaba Group 4Shanghai Jiao Tong University 5Engineering Research Center of Information Networks 6The Chinese University of Hong Kong, Shenzhen 7Institute of Software, Chinese Academy of Sciences 8University of Waterloo 9The University of Hong Kong 10Zhongguancun Laboratory
Pseudocode	No	The paper describes methods and processes (e.g., data collection, evaluation), but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	Github Repo [Git Hub Page] Rule-based Repo [Git Hub Page] Project Page [Project Page & Leaderboard] Dataset [Huggingface Dataset] Omni Judge [Huggingface Model]
Open Datasets	Yes	Dataset [Huggingface Dataset]
Dataset Splits	Yes	Specifically, we first constructed a dataset for training ( 17618), validation ( 2200), and test ( 2200) based on evaluation results from GPT-4o, which have no overlaps of questions with each other.
Hardware Specification	No	The paper mentions using "vllm framework" and setting token limits for API models, but it does not specify any particular CPU or GPU models, or detailed hardware specifications for running the experiments.
Software Dependencies	No	The paper mentions using "vllm framework" and "Sym Py", but does not provide specific version numbers for these or any other software libraries or tools.
Experiment Setup	Yes	To mitigate randomness in the responses, we set the parameters as follows: temperature = 0, top p = 1, and a maximum of 2048 tokens. For the O1-preview and O1-mini, due to constraints of inference costs, we configured the maximum completion tokens to 4096.