reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SBSC: Step-by-Step Coding for Improving Mathematical Olympiad Performance

Authors: Kunal Singh, Ankan Biswas, Sayandeep Bhowmick, Pradeep Moturi, Siva Gollapalli

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on Math Odyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC s greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on Math Odyssey. Scripts & Data is uploaded at this link.
Researcher Affiliation	Industry	Kunal Singh, Ankan Biswas, Sayandeep Bhowmick, Pradeep Moturi, Siva Kishore Gollapalli Fractal AI Research Mumbai, India {kunal.singh}@fractal.ai
Pseudocode	No	The paper describes the framework and provides mathematical generalizations (Equation 1) and extensive code examples in Appendix A.6, but it does not include a distinct block explicitly labeled 'Pseudocode' or 'Algorithm' that outlines the SBSC methodology in a structured, generalized format separate from the examples.
Open Source Code	Yes	Scripts & Data is uploaded at this link.
Open Datasets	Yes	We mainly use problems from 4 popular math competition datasets for benchmarking our performance: AIME, AMC, Math Odyssey (Fang et al., 2024) and Olympiad Bench (He et al., 2024), covering multiple domains, mainly: Algebra, Combinatorics, Number Theory and Geometry. We use problems of last 11 years from AMC and AIME, obtaining questions and answers (Q&A) in LATEX format from the Ao PS Wiki website.
Dataset Splits	Yes	Final test set, contains 330 AIME, 475 AMC-12, 158 Math Odyssey & 504 Olympiad Bench problems. ... For both AIME and AMC, we select 90 questions each, drawn from problems of years other than those included in the evaluation datasets. ... For each dataset, we create a subset of 10 problems correctly solved by every method and finally select a combination of 4 exemplars among them.
Hardware Specification	No	The paper mentions using proprietary LLMs like 'gpt-4o-2024-05-13 and Claude-3.5-Sonnet' and accessing 'Deep Seek Coder 2.5 via api', indicating that the authors did not manage or specify the hardware used for running their experiments. No specific CPU, GPU, or cloud instance types are mentioned.
Software Dependencies	No	The paper mentions using 'sympy-based python code' and shows imports for libraries like 'sympy', 'numpy', 'collections.deque', 'fractions.Fraction', and 'math' in its code examples. However, no specific version numbers are provided for Python or any of these libraries.
Experiment Setup	Yes	For all datasets and all reasoning frameworks, we use 4-shot setting. Maximum number of turns (n) SBSC is set to 15. For greedy decoding inference, we use temperature=0 and max_tokens=1024 and also, we run 3 times and report average. For greedy decoding of TIR-To RA, we keep n = 15 as well (Note: this is because although in TIRTo RA strategy the model attempts to solve the entire problem in the single turn, in case of execution error or readjustment it tries to re-attempt in subsequent turns). We also benchmark SBSC s greedy decoding results against self-consistency (SC) (Wang et al., 2022) decoding results (majority@7) of COT, PAL & TIR-TORA. We do this primarily for two reasons: First, SBSC takes multiple turns before arriving at the final answer (on average 6-7 turns per problem , Table 3 in Appendix A.1) and Secondly, to benchmark against the reliance of the current existing prompting strategies on majority voting for boosting accuracy. For SC decoding, we use temperature=0.7 and top_p=0.9. Note: we experimentally observe that for n > 4, there is insignificant increase in accuracy for TIR-To RA so we set n=4 for TIR-To RA during SC decoding.