reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Authors: Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that finetuning LLMs on Ao PS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces Live Ao PSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability.
Researcher Affiliation	Academia	1University of British Columbia, Vancouver, Canada 2Vector Institute for AI, Toronto, Canada 3Canada CIFAR AI Chair 4NSERC CRC Chair. Correspondence to: Sadegh Mahdavi <EMAIL>, Muchen Li <EMAIL>.
Pseudocode	No	The paper describes methods and pipelines in prose and with diagrams (e.g., Figure 2), but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our benchmark and code is available at https://livemathbench.github.io/leaderboard.html
Open Datasets	Yes	Our benchmark and code is available at https://livemathbench.github.io/leaderboard.html and mentions "Ao PS-Instruct, a dataset of more than 600,000 high-quality QA pairs." and "Live Ao PSBench, an evolving evaluation set". It also cites other public datasets like "GSM8K (Cobbe et al., 2021)" and "MATH (Hendrycks et al., 2021b)".
Dataset Splits	Yes	Topics posted up until December 2023 are used as the training set, while those posted between January and August 2024 are reserved as the evaluation dataset." and "Live Ao PSBench, is sourced from the Ao PS forum, with posts strictly between January 2023 and September 2024.
Hardware Specification	No	Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through the Digital Research Alliance of Canada alliance.can.ca, and companies sponsoring the Vector Institute www.vectorinstitute.ai/#partners, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant." This text describes general computing resources and funding bodies but lacks specific hardware details (e.g., GPU/CPU models, memory specifications).
Software Dependencies	No	The paper mentions using "Sym Py-based" for symbolic equivalence, and various LLM models like "Qwen 2.5 14B" and "Llama 3.1 70B" as part of its methodology. However, it does not provide specific version numbers for the underlying software frameworks, libraries, or programming languages used for implementation (e.g., PyTorch version, Python version, Transformers library version).
Experiment Setup	Yes	Consistent with prior work, we train each model for three epochs (Shao et al., 2024; Yang et al., 2024b), as we observe additional epochs provide no further benefit (see Figure 10 in the Appendix for ablation studies on the number of epochs).