RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval
Authors: Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT... We validate our theory on synthetic and natural language experiments. |
| Researcher Affiliation | Academia | Kaiyue Wen1 Xingyu Dang2 Kaifeng Lyu3 1 Stanford University 2 Tsinghua University 3 University of California, Berkeley EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Depth-First Search Algorithm Algorithm 2 Depth-First Search Algorithm with Retrieving |
| Open Source Code | No | The paper does not contain an explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We validate our theoretical findings through synthetic and natural language experiments on Is Tree and Hot Pot-QA... We use the Hotpot-QA (Yang et al., 2018) dataset. |
| Dataset Splits | No | The reported accuracy is calculated over a validation set of 5000 samples using the last iteration of the model... We only test on a subset of 350 samples of the validation set where all the models can answer correctly given the correct paragraphs. |
| Hardware Specification | Yes | We run all the experiments on a server with 8 A100s and the estimated time to reproduce the results is within 2 days. |
| Software Dependencies | No | The paper mentions models like LLaMA and Mamba architectures, and Python's `re` library, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train three different architectures... We train every model with at least 1M samples to guarantee convergence using Adam with a cosine learning rate... we train all the Transformer models with learning rates 1e-3 and the rest of the models with learning rates 3e-4... three different model sizes (0.5M, 1M, 2M) on Is Tree with three different sizes of graph (16, 32, 64) under three different setups... We test our models under a 4-shot setting with Chain-of-Thought. |