reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Learning and Forgetting for Reasoning with Large Language Models

Authors: Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180 .
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute & Université de Montréal, 2Amazon Web Services
Pseudocode	Yes	Algorithm 1 Fine-tuning on unpaired correct and failed paths from diverse reasoners
Open Source Code	Yes	Code is open-source at https://github.com/twni2016/llm-reasoning-uft.
Open Datasets	Yes	Game-of-24. ...For evaluation, we use the same test set from Yao et al. (2023a), consisting of 100 cases. ... Countdown. The Countdown game (Gandhi et al., 2024) extends Game-of-24...
Dataset Splits	Yes	Game-of-24. We randomly split the first 900 cases (rank #1-900 by human average performance in the dataset14) into Xtrain (720 cases) and Xvalid (180 cases). The Xtest are the next 100 cases (rank #901-1000) following the setup in To T (Yao et al., 2023a). Countdown. ...we randomly generate 500k training cases, 1k validation cases, and 1k test cases.
Hardware Specification	Yes	We follow the alignment handbook (Tunstall et al., 2024) to implement all fine-tuning methods on an instance with 8 A100 GPUs.
Software Dependencies	No	The paper mentions several libraries and models used, such as 'TRL library' (von Werra et al., 2020), 'v LLM' (Kwon et al., 2023), and 'Qwen2.5-Math' models (Yang et al., 2024). However, it does not provide specific version numbers for these software components, only citations with years for libraries or model names for LLMs.
Experiment Setup	Yes	We fix the batch size B to 128 and train E = 10 epochs on each dataset for Game-of-24 and E = 2 epochs for Countdown. We use a cosine schedule and sweep the peak learning rate η over (1e-5, 5e-6, 2e-6) for Q1.5B and over (5e-6, 2e-6, 1e-6) for Q7B. We set α = 0 to have the SFT baseline and sweep α over (1e-3, 1e-4, 1e-5, 1e-6) for our UFT.