Reasoning with Latent Thoughts: On the Power of Looped Transformers

Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on synthetic reasoning tasks like addition, p-hop induction and GSMstyle math word problems in Section 2. For these tasks, we surprisingly find that iso-flop looped models, despite having way fewer parameters, can nearly match or outperform a non-looped model. ... In Section 3, we train looped models on causal language modeling at 1B parameter scale. Here, we show that looped models have an inductive bias towards doing well on reasoning benchmarks, despite having much worse perplexity.
Researcher Affiliation Industry Nikunj Saunshi1, Nishanth Dikkala1, Zhiyuan Li1,2, Sanjiv Kumar1, Sashank J. Reddi1 EMAIL 1Google Research, 2Toyota Technological Institute at Chicago
Pseudocode Yes Algorithm 1 Causal Self-Attention, ATTN... Algorithm 2 Group Composition
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code or a link to a code repository.
Open Datasets Yes We train models on 250B tokens of the Pile dataset (Gao et al., 2020)... We evaluate the model on 4 important slices: closed book QA, open book QA, math word problems and reasoning primitives. These comprise of 19 different tasks in total. We defer details for the evaluation benchmarks in Appendix B.3... Closed book QA: This includes tasks like Trivia QA (Joshi et al., 2017), Tydi QA-No Context (Clark et al., 2020), Natural Questions (Kwiatkowski et al., 2019) and Web Questions (Talmor & Berant, 2018)... Open book QA: This includes tasks like Tydi QA-Gold P (Clark et al., 2020), Squad V2 (Rajpurkar et al., 2018), Drop (Dua et al., 2019), Qu AC (Choi et al., 2018), Co QA (Reddy et al., 2019)... Math word problems: This includes tasks like SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2020), the MAWPS benchmark (Koncel-Kedziorski et al., 2016).
Dataset Splits Yes Our train set consists of 4M examples and our test and validation sets consist of 262k examples each. (p-hop induction, Appendix B.1)... Our train dataset consists of around 4 million examples and we test on around 50k examples. (i-GSM, Appendix B.1)
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments.
Software Dependencies No We train using Adafactor (Shazeer & Stern, 2018) employing a linear warmup coupled with a cosine decay schedule for the learning rate.
Experiment Setup Yes All runs use a batch size of 1024, learning rate of 0.005 and run for 200k steps. (n-ary addition, Appendix B.1)... We train using Adafactor for 200,000 steps with a batch size of 256 using a base learning rate of 10-3 and use a linear warmup coupled with a cosine decay schedule for the learning rate. (p-hop induction and i-GSM, Appendix B.1)... For all experiments, we use a batch size of 512 and sequence length of 1280. We use a cosine learning rate schedule decaying over 400k steps with a peak learning rate of 0.01... The base model is a 1.5B parameter decoder only Transformer model, with 24 layers, model dimensions of 2048, hidden dimension 5120 and 32 heads. (Language Modeling, Appendix B.2)