Transformer Layers as Painters
Authors: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel. |
| Researcher Affiliation | Collaboration | 1Emergence AI 2Sakana AI, Japan 3Institute of Science Tokyo, Japan EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and experiments in prose and uses diagrams (e.g., Figure 1) to illustrate execution strategies, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/floatingbigcat/transformer_ layers_as_painters |
| Open Datasets | Yes | For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). This last, LAMBADA, measures perplexity and is closest to the raw token-prediction used during training. ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks. |
| Dataset Splits | Yes | For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks. |
| Hardware Specification | No | The paper discusses the models (Llama2-7B, BERT-Large, Mistral-7B, Pythia-6.9B) and their parameters, but does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper does not explicitly state any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for conducting the experiments. |
| Experiment Setup | Yes | Our experiments are primarily on two transformer models: Llama2 (Touvron et al. 2023), and on BERT-Large (Devlin et al. 2019). ... We used the standard pretrained checkpoints for these models. In all our experiments the models are frozen: we never modified the parameters of these models through fine-tuning or other methods, with the exception of the BERT evaluation, which includes a standard fine-tuning step. |