reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention-Only Transformers via Unrolled Subspace Denoising

Authors: Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we conduct extensive experiments on both vision and language tasks, including supervised image classification, causal language modeling, and in-context learning, to complement our theory and demonstrate the potential of our proposed transformer architecture. We emphasize that the goal of our experiments is not to strive for state-of-the-art performance for these tasks. Instead, they are intended to validate our theory about the components of the transformer.
Researcher Affiliation	Academia	1Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA 2Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA 3Institute of Data Science & School of Computing and Data Science, University of Hong Kong, Hong Kong, China.
Pseudocode	No	The paper describes the model architecture and iterative updates using mathematical equations (3), (4), and (5), and provides architectural diagrams (Figure 1, 3). However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	No	Our implementation of the GPT-2 type transformers and training pipeline is based on the framework outlined in (Karpathy, 2022).3 https://github.com/karpathy/nanoGPT.git. The paper refers to a third-party code repository (nanoGPT) as a basis for their implementation but does not provide a specific link or statement about releasing their own source code for the methodology described.
Open Datasets	Yes	We evaluate the performance of Ao T as a backbone architecture for supervised image classification on Image Net and compare it against several state-of-the-art models. We pre-train the Ao T-MSSA-L and Ao T-MHSA-L models of different sizes, along with GPT-2 (see Table 3 for model sizes), on Open Web Text (Gokaslan & Cohen, 2019). Using the above pre-trained models, we compute the cross-entropy validation loss without training on datasets Wiki Text (Merity et al., 2017)4, LAMBADA (Paperno et al., 2016)5, and PTB (Marcus et al., 1993) in Table 3. In addition, we report zero-shot accuracy in Table 3 on LAMBADA for predicting the final word of sentences, as well as on the Children’s Book Test (CBT) (Hill et al., 2015).
Dataset Splits	Yes	We employ Lion optimizer (Chen et al., 2024) to pre-train the Ao T-MSSA-V transformer on Image Net-21K for 90 epochs and to fine-tune it on Image Net-1K (Deng et al., 2009) for 50 epochs by minimizing the cross-entropy (CE) loss. We next train Ao T-MHSA-V from scratch on Image Net-1K for 150 epochs. We pre-train the Ao T-MSSA-L and Ao T-MHSA-L models of different sizes... on Open Web Text... Zero-shot evaluation. Using the above pre-trained models, we compute the cross-entropy validation loss without training on datasets Wiki Text, LAMBADA, and PTB... as well as on the Children’s Book Test (CBT).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	We employ Lion optimizer (Chen et al., 2024) to pre-train... Our implementation... is based on the framework outlined in (Karpathy, 2022).3 https://github.com/karpathy/nanoGPT.git. We use the Adam W optimizer (Loshchilov & Hutter, 2019)... using Adam optimizer (Kingma & Ba, 2014). The paper mentions software components like Lion optimizer, nanoGPT, Adam W, and Adam optimizer but does not specify their version numbers.
Experiment Setup	Yes	During pre-training, we use a learning rate of 2e-4, weight decay of 0.7, label smoothing with a parameter of 0.2, and a batch size of 4096. For fine-tuning, the corresponding values are 5e-4, 0.3, 0.1, and 2048, respectively. We use the Lion optimizer with a learning rate of 5e-4, a weight decay of 0.1, label smoothing with a smoothing parameter of 0.1, a batch size of 2048. We train these models over a 1024-token context using the Adam W optimizer. For all experiments, we set the number of heads to 8 and the embedding size to 128. ... train the models for 50,000 iterations using Adam optimizer (Kingma & Ba, 2014).