Transformer Layers as Painters

Authors: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
Researcher Affiliation Collaboration 1Emergence AI 2Sakana AI, Japan 3Institute of Science Tokyo, Japan EMAIL, EMAIL
Pseudocode No The paper describes methods and experiments in prose and uses diagrams (e.g., Figure 1) to illustrate execution strategies, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/floatingbigcat/transformer_ layers_as_painters
Open Datasets Yes For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). This last, LAMBADA, measures perplexity and is closest to the raw token-prediction used during training. ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks.
Dataset Splits Yes For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks.
Hardware Specification No The paper discusses the models (Llama2-7B, BERT-Large, Mistral-7B, Pythia-6.9B) and their parameters, but does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper does not explicitly state any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for conducting the experiments.
Experiment Setup Yes Our experiments are primarily on two transformer models: Llama2 (Touvron et al. 2023), and on BERT-Large (Devlin et al. 2019). ... We used the standard pretrained checkpoints for these models. In all our experiments the models are frozen: we never modified the parameters of these models through fine-tuning or other methods, with the exception of the BERT evaluation, which includes a standard fine-tuning step.