SBSC: Step-by-Step Coding for Improving Mathematical Olympiad Performance

Authors: Kunal Singh, Ankan Biswas, Sayandeep Bhowmick, Pradeep Moturi, Siva Gollapalli

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments highlight the effectiveness of SBSC in tackling competition and Olympiad-level math problems. For Claude-3.5-Sonnet, we observe SBSC (greedy decoding) surpasses existing state-of-the-art (SOTA) program generation based reasoning strategies by absolute 10.7% on AMC12, 8% on AIME and 12.6% on Math Odyssey. Given SBSC is multi-turn in nature, we also benchmark SBSC s greedy decoding against self-consistency decoding results of existing SOTA math reasoning strategies and observe performance gain by absolute 6.2% on AMC, 6.7% on AIME and 7.4% on Math Odyssey. Scripts & Data is uploaded at this link.
Researcher Affiliation Industry Kunal Singh, Ankan Biswas, Sayandeep Bhowmick, Pradeep Moturi, Siva Kishore Gollapalli Fractal AI Research Mumbai, India {kunal.singh}@fractal.ai
Pseudocode No The paper describes the framework and provides mathematical generalizations (Equation 1) and extensive code examples in Appendix A.6, but it does not include a distinct block explicitly labeled 'Pseudocode' or 'Algorithm' that outlines the SBSC methodology in a structured, generalized format separate from the examples.
Open Source Code Yes Scripts & Data is uploaded at this link.
Open Datasets Yes We mainly use problems from 4 popular math competition datasets for benchmarking our performance: AIME, AMC, Math Odyssey (Fang et al., 2024) and Olympiad Bench (He et al., 2024), covering multiple domains, mainly: Algebra, Combinatorics, Number Theory and Geometry. We use problems of last 11 years from AMC and AIME, obtaining questions and answers (Q&A) in LATEX format from the Ao PS Wiki website.
Dataset Splits Yes Final test set, contains 330 AIME, 475 AMC-12, 158 Math Odyssey & 504 Olympiad Bench problems. ... For both AIME and AMC, we select 90 questions each, drawn from problems of years other than those included in the evaluation datasets. ... For each dataset, we create a subset of 10 problems correctly solved by every method and finally select a combination of 4 exemplars among them.
Hardware Specification No The paper mentions using proprietary LLMs like 'gpt-4o-2024-05-13 and Claude-3.5-Sonnet' and accessing 'Deep Seek Coder 2.5 via api', indicating that the authors did not manage or specify the hardware used for running their experiments. No specific CPU, GPU, or cloud instance types are mentioned.
Software Dependencies No The paper mentions using 'sympy-based python code' and shows imports for libraries like 'sympy', 'numpy', 'collections.deque', 'fractions.Fraction', and 'math' in its code examples. However, no specific version numbers are provided for Python or any of these libraries.
Experiment Setup Yes For all datasets and all reasoning frameworks, we use 4-shot setting. Maximum number of turns (n) SBSC is set to 15. For greedy decoding inference, we use temperature=0 and max_tokens=1024 and also, we run 3 times and report average. For greedy decoding of TIR-To RA, we keep n = 15 as well (Note: this is because although in TIRTo RA strategy the model attempts to solve the entire problem in the single turn, in case of execution error or readjustment it tries to re-attempt in subsequent turns). We also benchmark SBSC s greedy decoding results against self-consistency (SC) (Wang et al., 2022) decoding results (majority@7) of COT, PAL & TIR-TORA. We do this primarily for two reasons: First, SBSC takes multiple turns before arriving at the final answer (on average 6-7 turns per problem , Table 3 in Appendix A.1) and Secondly, to benchmark against the reliance of the current existing prompting strategies on majority voting for boosting accuracy. For SC decoding, we use temperature=0.7 and top_p=0.9. Note: we experimentally observe that for n > 4, there is insignificant increase in accuracy for TIR-To RA so we set n=4 for TIR-To RA during SC decoding.