Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Authors: Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that finetuning LLMs on Ao PS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces Live Ao PSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability.
Researcher Affiliation Academia 1University of British Columbia, Vancouver, Canada 2Vector Institute for AI, Toronto, Canada 3Canada CIFAR AI Chair 4NSERC CRC Chair. Correspondence to: Sadegh Mahdavi <EMAIL>, Muchen Li <EMAIL>.
Pseudocode No The paper describes methods and pipelines in prose and with diagrams (e.g., Figure 2), but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes Our benchmark and code is available at https://livemathbench.github.io/leaderboard.html
Open Datasets Yes Our benchmark and code is available at https://livemathbench.github.io/leaderboard.html and mentions "Ao PS-Instruct, a dataset of more than 600,000 high-quality QA pairs." and "Live Ao PSBench, an evolving evaluation set". It also cites other public datasets like "GSM8K (Cobbe et al., 2021)" and "MATH (Hendrycks et al., 2021b)".
Dataset Splits Yes Topics posted up until December 2023 are used as the training set, while those posted between January and August 2024 are reserved as the evaluation dataset." and "Live Ao PSBench, is sourced from the Ao PS forum, with posts strictly between January 2023 and September 2024.
Hardware Specification No Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through the Digital Research Alliance of Canada alliance.can.ca, and companies sponsoring the Vector Institute www.vectorinstitute.ai/#partners, and Advanced Research Computing at the University of British Columbia. Additional hardware support was provided by John R. Evans Leaders Fund CFI grant." This text describes general computing resources and funding bodies but lacks specific hardware details (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper mentions using "Sym Py-based" for symbolic equivalence, and various LLM models like "Qwen 2.5 14B" and "Llama 3.1 70B" as part of its methodology. However, it does not provide specific version numbers for the underlying software frameworks, libraries, or programming languages used for implementation (e.g., PyTorch version, Python version, Transformers library version).
Experiment Setup Yes Consistent with prior work, we train each model for three epochs (Shao et al., 2024; Yang et al., 2024b), as we observe additional epochs provide no further benefit (see Figure 10 in the Appendix for ablation studies on the number of epochs).