reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lambda-Skip Connections: the architectural component that prevents Rank Collapse

Authors: Federico Arangath Joseph, Jerome Sieber, Melanie Zeilinger, Carmen Amo Alonso

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we aim to both explore the link between gating mechanisms and rank collapse and to empirically validate our theoretical findings. In Section 5.1, we empirically validate Theorem 4.1, demonstrating the importance of selecting the appropriate skip connection strength to mitigate rank collapse. In Section 5.2, we show that for the Mamba-2 architecture (Dao and Gu (2024)), gating mechanisms indeed play a crucial role in preventing rank collapse. Finally, we validate our findings with experiments demonstrating the crucial role of architectural components such as skip connections and gating mechanisms in preventing rank collapse.
Researcher Affiliation	Academia	Federico Arangath Joseph ETH Zurich EMAIL Jerome Sieber ETH Zurich EMAIL Melanie N. Zeilinger ETH Zurich EMAIL Carmen Amo Alonso Stanford University EMAIL
Pseudocode	No	The paper describes theoretical analysis and experimental validation but does not include any explicit pseudocode blocks or algorithms labeled as such.
Open Source Code	No	We use the standard code bases provided online.3 https://github.com/HazyResearch/zoology and https://github.com/google-research/long-range-arena. This refers to external codebases used by the authors for benchmarks, not the authors' own implementation code for the methodology described in this paper.
Open Datasets	Yes	We sample 32 text excerpts from Wikipedia using the Wikipedia API and tokenize them in sequences of at least 128 tokens using the BERT tokenizer Devlin et al. (2019). We train two transformer-based architectures and two SSM-based architectures on the image task of the long range arena (LRA) benchmark (Tay et al., 2020) and on a multi-query associative recall (MQAR) task (Arora et al., 2023). We train the S4D architecture with 32 layers and 1.6 million parameters on the Cifar10 dataset.
Dataset Splits	No	For the MQAR experiments we use the following training protocol: ... Each model is trained on 100,000 datapoints and evaluated on 3,000 datapoints. ... For the LRA image experiments we use the following training protocol: ... Each model is trained on 35,200 datapoints and evaluated on 7,850 datapoints. The paper specifies training and evaluation (test) sample counts but does not explicitly mention a separate validation split or detail a three-way split (train/validation/test).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using 'BERT tokenizer' and 'Adam W optimizer' but does not specify version numbers for these or other software libraries/frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	Optimizer and schedule: Weight decay of 0.1, linear warmup with duration of 10%, Adam W optimizer (Loshchilov and Hutter, 2019). For each run, we sweep the learning rates in np.logspace( 4, 2, 4) and train for 64 epochs. This is the same setup as in (Arora et al., 2023). Initialization: For all models we use their standard initialization and initialize λ = 1 in each layer. Training duration: We use a global batch size of 64. Width and depth: For all runs, we use two layers (each with a sequence model and a MLP, interleaved with layer normalization). The model dimensions d = 128, state dimension n = 64, sequence length L = 512, and number of KV pairs = 64 are kept constant for all four architectures.