reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DeepCrossAttention: Supercharging Transformer Residual Connections

Authors: Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We additionally provide empirical results to support the theoretical findings and demonstrate the effectiveness of DCA. Experiments on language modeling and image classification tasks demonstrate that DCA consistently outperforms the standard transformer architectures in terms of perplexity, accuracy and training efficiency. DCA achieves lower perplexity for a given parameter budget and training time.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of California, Irvine, USA 2Marshall School of Business, University of Southern California, Los Angeles, USA 3Google Research, New York, USA. Correspondence to: Mike Heddes <EMAIL>, Gang Fu <EMAIL>.
Pseudocode	No	The paper includes computation diagrams (Figure 3 and 4) to illustrate the network architecture but does not provide structured pseudocode or algorithm blocks describing a method or procedure.
Open Source Code	No	The paper does not contain any explicit statements about releasing code or provide a link to a code repository.
Open Datasets	Yes	For the language modeling tasks, the performance of DCA is compared against the standard transformer (Vaswani, 2017) on the LM1B (Chelba et al., 2013) and C4 (Raffel et al., 2020a) datasets. We also experiment with image classiﬁcation using the Image Net dataset and the vision transformer (Vi T) model (Dosovitskiy et al., 2021).
Dataset Splits	Yes	For the language modeling tasks, the performance of DCA is compared against the standard transformer (Vaswani, 2017) on the LM1B (Chelba et al., 2013) and C4 (Raffel et al., 2020a) datasets. Unless stated otherwise, each model has an embedding dimension of 512 and an MLP dimension of four times the embedding dimension. By default, DCA uses a stack of all the previous layer outputs as input to the GRNs. When DCA includes only the ﬁrst and last-k layer outputs explicitly in the input stack (see Section 3.1), then this is denoted as k-DCA. Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.98, a weight decay of 0.1, and a learning rate of 0.0016 with 1000 warmup steps and an inverse square root schedule (Raffel et al., 2020b).
Hardware Specification	Yes	Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens.
Software Dependencies	No	The paper mentions using the Adam W optimizer (Loshchilov & Hutter, 2017) but does not provide specific version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup	Yes	Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.98, a weight decay of 0.1, and a learning rate of 0.0016 with 1000 warmup steps and an inverse square root schedule (Raffel et al., 2020b).