Offline Learning and Forgetting for Reasoning with Large Language Models
Authors: Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180 . |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute & Université de Montréal, 2Amazon Web Services |
| Pseudocode | Yes | Algorithm 1 Fine-tuning on unpaired correct and failed paths from diverse reasoners |
| Open Source Code | Yes | Code is open-source at https://github.com/twni2016/llm-reasoning-uft. |
| Open Datasets | Yes | Game-of-24. ...For evaluation, we use the same test set from Yao et al. (2023a), consisting of 100 cases. ... Countdown. The Countdown game (Gandhi et al., 2024) extends Game-of-24... |
| Dataset Splits | Yes | Game-of-24. We randomly split the first 900 cases (rank #1-900 by human average performance in the dataset14) into Xtrain (720 cases) and Xvalid (180 cases). The Xtest are the next 100 cases (rank #901-1000) following the setup in To T (Yao et al., 2023a). Countdown. ...we randomly generate 500k training cases, 1k validation cases, and 1k test cases. |
| Hardware Specification | Yes | We follow the alignment handbook (Tunstall et al., 2024) to implement all fine-tuning methods on an instance with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions several libraries and models used, such as 'TRL library' (von Werra et al., 2020), 'v LLM' (Kwon et al., 2023), and 'Qwen2.5-Math' models (Yang et al., 2024). However, it does not provide specific version numbers for these software components, only citations with years for libraries or model names for LLMs. |
| Experiment Setup | Yes | We fix the batch size B to 128 and train E = 10 epochs on each dataset for Game-of-24 and E = 2 epochs for Countdown. We use a cosine schedule and sweep the peak learning rate η over (1e-5, 5e-6, 2e-6) for Q1.5B and over (5e-6, 2e-6, 1e-6) for Q7B. We set α = 0 to have the SFT baseline and sweep α over (1e-3, 1e-4, 1e-5, 1e-6) for our UFT. |