reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Authors: Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data. We evaluate LIPS on 161 challenging inequalities collected from three problem sets (Chen, 2014; Tung, 2012; Wei et al., 2024). The experimental results show that LIPS consistently achieves state-of-the-art performance and significantly outperforms existing neural and symbolic methods in both effectiveness and speed. Table 1: Proof success rates (%) on three datasets. LIPS consistently outperforms baselines.
Researcher Affiliation	Collaboration	Zenan Li1 , Zhaoyu Li2 , Wen Tang3, Xian Zhang4, Yuan Yao1, Xujie Si2, Fan Yang4, Kaiyu Yang5 , Xiaoxing Ma1 1 State Key Lab of Novel Software Technology, Nanjing University, China, 2 University of Toronto, 3 Peking University, 4 Microsoft Research, 5 Meta FAIR
Pseudocode	Yes	Algorithm (3) outlines the overall process of LIPS proof generation. The detailed steps for tactic generation and pruning are provided in Algorithm 1, while the specifics for goal filtering and ranking are described in Algorithm 2. (Algorithms 1, 2, and 3 are present in Appendix B).
Open Source Code	Yes	The code, together with the experimental data, is available at https://github.com/Lizn-zn/Neq LIPS.
Open Datasets	Yes	We evaluate LIPS on three datasets: Chen NEQ, MO-INT-20, and 567NEQ, respectively. Chen NEQ consists of 41 Olympiad-level inequalities collected by Chen (2014); MO-INT is a new competition-level inequality benchmark introduced in AIPS (Wei et al., 2024), featuring 20 problems sourced from IMO shortlists and various national mathematical Olympiads; 567NEQ consists of 567 hard inequalities created by Tung (2012) and we randomly selected 100 problems from the original problem set as the testbed for our framework.
Dataset Splits	No	We evaluate LIPS on three datasets: Chen NEQ, MO-INT-20, and 567NEQ, respectively. Chen NEQ consists of 41 Olympiad-level inequalities collected by Chen (2014); MO-INT is a new competition-level inequality benchmark introduced in AIPS (Wei et al., 2024), featuring 20 problems sourced from IMO shortlists and various national mathematical Olympiads; 567NEQ consists of 567 hard inequalities created by Tung (2012) and we randomly selected 100 problems from the original problem set as the testbed for our framework. This text describes the datasets used for evaluation but does not specify training, validation, or test splits for LIPS's own learning process or for reproducing specific experimental conditions beyond problem selection.
Hardware Specification	Yes	The experiments were conducted on four Linux servers equipped with 4 Intel(R) Xeon(R) Platinum 8280L CPU @ 2.80GHz. Each server ran Ubuntu 22.04 with GNU/Linux kernel 6.5.0-1015-azure. Each proving task was performed within a docker sandbox, utilizing 192 assigned CPU cores.
Software Dependencies	Yes	Each server ran Ubuntu 22.04 with GNU/Linux kernel 6.5.0-1015-azure. For the LLM involved in transformation tactic generation and proof goal ranking, we use GPT-4o (version Azure-0501). RC-CAD refers to the Cylindrical Algebraic Decompose function in Maple 2024 Regular Chain package; As to MMA, we integrate two commands, i.e., Reduce and Find Instance in Wolfram-Mathematica (version 13.0.1).
Experiment Setup	Yes	The time limit of searching counterexamples is set to 5 seconds. We use a low-temperature setting of GPT-4o in rewriting tactic generation and neural ranking (T=0.1, top p=1.0, max token=2048). In symbolic filtering, we fix the size of the filtered goal set to 10. For each proving task, a time limit of 90 minutes is imposed.