reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Authors: Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, Sashank J. Reddi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on synthetic reasoning tasks like addition, p-hop induction and GSMstyle math word problems in Section 2. For these tasks, we surprisingly find that iso-flop looped models, despite having way fewer parameters, can nearly match or outperform a non-looped model. ... In Section 3, we train looped models on causal language modeling at 1B parameter scale. Here, we show that looped models have an inductive bias towards doing well on reasoning benchmarks, despite having much worse perplexity.
Researcher Affiliation	Industry	Nikunj Saunshi1, Nishanth Dikkala1, Zhiyuan Li1,2, Sanjiv Kumar1, Sashank J. Reddi1 EMAIL 1Google Research, 2Toyota Technological Institute at Chicago
Pseudocode	Yes	Algorithm 1 Causal Self-Attention, ATTN... Algorithm 2 Group Composition
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing the code or a link to a code repository.
Open Datasets	Yes	We train models on 250B tokens of the Pile dataset (Gao et al., 2020)... We evaluate the model on 4 important slices: closed book QA, open book QA, math word problems and reasoning primitives. These comprise of 19 different tasks in total. We defer details for the evaluation benchmarks in Appendix B.3... Closed book QA: This includes tasks like Trivia QA (Joshi et al., 2017), Tydi QA-No Context (Clark et al., 2020), Natural Questions (Kwiatkowski et al., 2019) and Web Questions (Talmor & Berant, 2018)... Open book QA: This includes tasks like Tydi QA-Gold P (Clark et al., 2020), Squad V2 (Rajpurkar et al., 2018), Drop (Dua et al., 2019), Qu AC (Choi et al., 2018), Co QA (Reddy et al., 2019)... Math word problems: This includes tasks like SVAMP (Patel et al., 2021), ASDiv (Miao et al., 2020), the MAWPS benchmark (Koncel-Kedziorski et al., 2016).
Dataset Splits	Yes	Our train set consists of 4M examples and our test and validation sets consist of 262k examples each. (p-hop induction, Appendix B.1)... Our train dataset consists of around 4 million examples and we test on around 50k examples. (i-GSM, Appendix B.1)
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU, CPU models, or TPU versions) used for running the experiments.
Software Dependencies	No	We train using Adafactor (Shazeer & Stern, 2018) employing a linear warmup coupled with a cosine decay schedule for the learning rate.
Experiment Setup	Yes	All runs use a batch size of 1024, learning rate of 0.005 and run for 200k steps. (n-ary addition, Appendix B.1)... We train using Adafactor for 200,000 steps with a batch size of 256 using a base learning rate of 10-3 and use a linear warmup coupled with a cosine decay schedule for the learning rate. (p-hop induction and i-GSM, Appendix B.1)... For all experiments, we use a batch size of 512 and sequence length of 1280. We use a cosine learning rate schedule decaying over 400k steps with a peak learning rate of 0.01... The base model is a 1.5B parameter decoder only Transformer model, with 24 layers, model dimensions of 2048, hidden dimension 5120 and 32 heads. (Language Modeling, Appendix B.2)