reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformer Layers as Painters

Authors: Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a series of empirical studies on frozen models that show that the lower and ﬁnal layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.
Researcher Affiliation	Collaboration	1Emergence AI 2Sakana AI, Japan 3Institute of Science Tokyo, Japan EMAIL, EMAIL
Pseudocode	No	The paper describes methods and experiments in prose and uses diagrams (e.g., Figure 1) to illustrate execution strategies, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/ﬂoatingbigcat/transformer_ layers_as_painters
Open Datasets	Yes	For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). This last, LAMBADA, measures perplexity and is closest to the raw token-prediction used during training. ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks.
Dataset Splits	Yes	For Llama2, we use ARC (science exam questions) (Clark et al. 2018), Hella Swag (commonsense) (Zellers et al. 2019), GSM8K (Math Word Problems) (Cobbe et al. 2021), Wino Grande (Winograd Schema Challenge) (Sakaguchi et al. 2019), and LAMBADA (word prediction) (Paperno et al. 2016). ... For BERT, we used tasks from the GLUE benchmark (Wang et al. 2018) and followed their evaluation protocol, including reporting the unnormalized average of the benchmarks.
Hardware Specification	No	The paper discusses the models (Llama2-7B, BERT-Large, Mistral-7B, Pythia-6.9B) and their parameters, but does not provide specific details about the hardware used to run the experiments.
Software Dependencies	No	The paper does not explicitly state any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) used for conducting the experiments.
Experiment Setup	Yes	Our experiments are primarily on two transformer models: Llama2 (Touvron et al. 2023), and on BERT-Large (Devlin et al. 2019). ... We used the standard pretrained checkpoints for these models. In all our experiments the models are frozen: we never modiﬁed the parameters of these models through ﬁne-tuning or other methods, with the exception of the BERT evaluation, which includes a standard ﬁne-tuning step.