DeepCrossAttention: Supercharging Transformer Residual Connections

Authors: Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We additionally provide empirical results to support the theoretical findings and demonstrate the effectiveness of DCA. Experiments on language modeling and image classification tasks demonstrate that DCA consistently outperforms the standard transformer architectures in terms of perplexity, accuracy and training efficiency. DCA achieves lower perplexity for a given parameter budget and training time.
Researcher Affiliation Collaboration 1Department of Computer Science, University of California, Irvine, USA 2Marshall School of Business, University of Southern California, Los Angeles, USA 3Google Research, New York, USA. Correspondence to: Mike Heddes <EMAIL>, Gang Fu <EMAIL>.
Pseudocode No The paper includes computation diagrams (Figure 3 and 4) to illustrate the network architecture but does not provide structured pseudocode or algorithm blocks describing a method or procedure.
Open Source Code No The paper does not contain any explicit statements about releasing code or provide a link to a code repository.
Open Datasets Yes For the language modeling tasks, the performance of DCA is compared against the standard transformer (Vaswani, 2017) on the LM1B (Chelba et al., 2013) and C4 (Raffel et al., 2020a) datasets. We also experiment with image classification using the Image Net dataset and the vision transformer (Vi T) model (Dosovitskiy et al., 2021).
Dataset Splits Yes For the language modeling tasks, the performance of DCA is compared against the standard transformer (Vaswani, 2017) on the LM1B (Chelba et al., 2013) and C4 (Raffel et al., 2020a) datasets. Unless stated otherwise, each model has an embedding dimension of 512 and an MLP dimension of four times the embedding dimension. By default, DCA uses a stack of all the previous layer outputs as input to the GRNs. When DCA includes only the first and last-k layer outputs explicitly in the input stack (see Section 3.1), then this is denoted as k-DCA. Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.98, a weight decay of 0.1, and a learning rate of 0.0016 with 1000 warmup steps and an inverse square root schedule (Raffel et al., 2020b).
Hardware Specification Yes Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens.
Software Dependencies No The paper mentions using the Adam W optimizer (Loshchilov & Hutter, 2017) but does not provide specific version numbers for any software libraries or frameworks used in the implementation.
Experiment Setup Yes Each model is trained with a sequence length of 128 and a batch size of 2048 over 64 TPUs for 500k steps, totaling 131B tokens. We use the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.98, a weight decay of 0.1, and a learning rate of 0.0016 with 1000 warmup steps and an inverse square root schedule (Raffel et al., 2020b).