Attention-Only Transformers via Unrolled Subspace Denoising
Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we conduct extensive experiments on both vision and language tasks, including supervised image classification, causal language modeling, and in-context learning, to complement our theory and demonstrate the potential of our proposed transformer architecture. We emphasize that the goal of our experiments is not to strive for state-of-the-art performance for these tasks. Instead, they are intended to validate our theory about the components of the transformer. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA 2Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA 3Institute of Data Science & School of Computing and Data Science, University of Hong Kong, Hong Kong, China. |
| Pseudocode | No | The paper describes the model architecture and iterative updates using mathematical equations (3), (4), and (5), and provides architectural diagrams (Figure 1, 3). However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | No | Our implementation of the GPT-2 type transformers and training pipeline is based on the framework outlined in (Karpathy, 2022).3 https://github.com/karpathy/nanoGPT.git. The paper refers to a third-party code repository (nanoGPT) as a basis for their implementation but does not provide a specific link or statement about releasing their own source code for the methodology described. |
| Open Datasets | Yes | We evaluate the performance of Ao T as a backbone architecture for supervised image classification on Image Net and compare it against several state-of-the-art models. We pre-train the Ao T-MSSA-L and Ao T-MHSA-L models of different sizes, along with GPT-2 (see Table 3 for model sizes), on Open Web Text (Gokaslan & Cohen, 2019). Using the above pre-trained models, we compute the cross-entropy validation loss without training on datasets Wiki Text (Merity et al., 2017)4, LAMBADA (Paperno et al., 2016)5, and PTB (Marcus et al., 1993) in Table 3. In addition, we report zero-shot accuracy in Table 3 on LAMBADA for predicting the final word of sentences, as well as on the Children’s Book Test (CBT) (Hill et al., 2015). |
| Dataset Splits | Yes | We employ Lion optimizer (Chen et al., 2024) to pre-train the Ao T-MSSA-V transformer on Image Net-21K for 90 epochs and to fine-tune it on Image Net-1K (Deng et al., 2009) for 50 epochs by minimizing the cross-entropy (CE) loss. We next train Ao T-MHSA-V from scratch on Image Net-1K for 150 epochs. We pre-train the Ao T-MSSA-L and Ao T-MHSA-L models of different sizes... on Open Web Text... Zero-shot evaluation. Using the above pre-trained models, we compute the cross-entropy validation loss without training on datasets Wiki Text, LAMBADA, and PTB... as well as on the Children’s Book Test (CBT). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | We employ Lion optimizer (Chen et al., 2024) to pre-train... Our implementation... is based on the framework outlined in (Karpathy, 2022).3 https://github.com/karpathy/nanoGPT.git. We use the Adam W optimizer (Loshchilov & Hutter, 2019)... using Adam optimizer (Kingma & Ba, 2014). The paper mentions software components like Lion optimizer, nanoGPT, Adam W, and Adam optimizer but does not specify their version numbers. |
| Experiment Setup | Yes | During pre-training, we use a learning rate of 2e-4, weight decay of 0.7, label smoothing with a parameter of 0.2, and a batch size of 4096. For fine-tuning, the corresponding values are 5e-4, 0.3, 0.1, and 2048, respectively. We use the Lion optimizer with a learning rate of 5e-4, a weight decay of 0.1, label smoothing with a smoothing parameter of 0.1, a batch size of 2048. We train these models over a 1024-token context using the Adam W optimizer. For all experiments, we set the number of heads to 8 and the embedding size to 128. ... train the models for 50,000 iterations using Adam optimizer (Kingma & Ba, 2014). |