Offline Learning and Forgetting for Reasoning with Large Language Models

Authors: Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180 .
Researcher Affiliation Collaboration 1Mila Quebec AI Institute & Université de Montréal, 2Amazon Web Services
Pseudocode Yes Algorithm 1 Fine-tuning on unpaired correct and failed paths from diverse reasoners
Open Source Code Yes Code is open-source at https://github.com/twni2016/llm-reasoning-uft.
Open Datasets Yes Game-of-24. ...For evaluation, we use the same test set from Yao et al. (2023a), consisting of 100 cases. ... Countdown. The Countdown game (Gandhi et al., 2024) extends Game-of-24...
Dataset Splits Yes Game-of-24. We randomly split the first 900 cases (rank #1-900 by human average performance in the dataset14) into Xtrain (720 cases) and Xvalid (180 cases). The Xtest are the next 100 cases (rank #901-1000) following the setup in To T (Yao et al., 2023a). Countdown. ...we randomly generate 500k training cases, 1k validation cases, and 1k test cases.
Hardware Specification Yes We follow the alignment handbook (Tunstall et al., 2024) to implement all fine-tuning methods on an instance with 8 A100 GPUs.
Software Dependencies No The paper mentions several libraries and models used, such as 'TRL library' (von Werra et al., 2020), 'v LLM' (Kwon et al., 2023), and 'Qwen2.5-Math' models (Yang et al., 2024). However, it does not provide specific version numbers for these software components, only citations with years for libraries or model names for LLMs.
Experiment Setup Yes We fix the batch size B to 128 and train E = 10 epochs on each dataset for Game-of-24 and E = 2 epochs for Countdown. We use a cosine schedule and sweep the peak learning rate η over (1e-5, 5e-6, 2e-6) for Q1.5B and over (5e-6, 2e-6, 1e-6) for Q7B. We set α = 0 to have the SFT baseline and sweep α over (1e-3, 1e-4, 1e-5, 1e-6) for our UFT.