Lambda-Skip Connections: the architectural component that prevents Rank Collapse
Authors: Federico Arangath Joseph, Jerome Sieber, Melanie Zeilinger, Carmen Amo Alonso
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we aim to both explore the link between gating mechanisms and rank collapse and to empirically validate our theoretical findings. In Section 5.1, we empirically validate Theorem 4.1, demonstrating the importance of selecting the appropriate skip connection strength to mitigate rank collapse. In Section 5.2, we show that for the Mamba-2 architecture (Dao and Gu (2024)), gating mechanisms indeed play a crucial role in preventing rank collapse. Finally, we validate our findings with experiments demonstrating the crucial role of architectural components such as skip connections and gating mechanisms in preventing rank collapse. |
| Researcher Affiliation | Academia | Federico Arangath Joseph ETH Zurich EMAIL Jerome Sieber ETH Zurich EMAIL Melanie N. Zeilinger ETH Zurich EMAIL Carmen Amo Alonso Stanford University EMAIL |
| Pseudocode | No | The paper describes theoretical analysis and experimental validation but does not include any explicit pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | We use the standard code bases provided online.3 https://github.com/HazyResearch/zoology and https://github.com/google-research/long-range-arena. This refers to external codebases used by the authors for benchmarks, not the authors' own implementation code for the methodology described in this paper. |
| Open Datasets | Yes | We sample 32 text excerpts from Wikipedia using the Wikipedia API and tokenize them in sequences of at least 128 tokens using the BERT tokenizer Devlin et al. (2019). We train two transformer-based architectures and two SSM-based architectures on the image task of the long range arena (LRA) benchmark (Tay et al., 2020) and on a multi-query associative recall (MQAR) task (Arora et al., 2023). We train the S4D architecture with 32 layers and 1.6 million parameters on the Cifar10 dataset. |
| Dataset Splits | No | For the MQAR experiments we use the following training protocol: ... Each model is trained on 100,000 datapoints and evaluated on 3,000 datapoints. ... For the LRA image experiments we use the following training protocol: ... Each model is trained on 35,200 datapoints and evaluated on 7,850 datapoints. The paper specifies training and evaluation (test) sample counts but does not explicitly mention a separate validation split or detail a three-way split (train/validation/test). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'BERT tokenizer' and 'Adam W optimizer' but does not specify version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Optimizer and schedule: Weight decay of 0.1, linear warmup with duration of 10%, Adam W optimizer (Loshchilov and Hutter, 2019). For each run, we sweep the learning rates in np.logspace( 4, 2, 4) and train for 64 epochs. This is the same setup as in (Arora et al., 2023). Initialization: For all models we use their standard initialization and initialize λ = 1 in each layer. Training duration: We use a global batch size of 64. Width and depth: For all runs, we use two layers (each with a sequence model and a MLP, interleaved with layer normalization). The model dimensions d = 128, state dimension n = 64, sequence length L = 512, and number of KV pairs = 64 are kept constant for all four architectures. |