reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Unreasonable Ineffectiveness of the Deeper Layers

Authors: Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our ﬁrst set of results are shown in Figure 2, where we plot 5-shot MMLU accuracy as a function of the fraction of layers removed: in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we show Mistral-7B and Phi-2. In order to better compare models of different total number of layers, in these plots we opted to normalize the x-axis by the fraction of layers removed (rather than the absolute number of layers removed). Note that since MMLU contains multiple choice questions with four possible responses, the expected accuracy of random guessing is 25%.
Researcher Affiliation	Collaboration	Andrey Gromov Meta FAIR & UMD Kushal Tirumala Meta FAIR Hassan Shapourian Cisco Paolo Glorioso Zyphra Daniel A. Roberts MIT & Sequoia Capital. Co-ﬁrst authors; please direct correspondence to the union of {EMAIL, EMAIL, EMAIL}.
Pseudocode	Yes	Our principal layer pruning algorithm is very simple: 0. Pick a a number of layers to prune n. 1. Compute the angular distance d(x( ), x( +n)), cf. (7) below, between the input to layer and the input to layer + n on a neutral pretraining dataset or on a dataset representative of a downstream task of interest. 2. Find the layer, , that minimizes that distance: ?(n) arg min d(x( ), x( +n)) . (6) 3. Drop layers ? to ?+n 1; connect the old input to layer ? to the old ( ?+n)th layer block. 4. (Optionally) heal the mismatch at layer ? + n with a small amount of ﬁne tuning on a neutral pretraining dataset or particular dataset of interest. If fewer words inside of a ﬁgure are more helpful to you than the text in an enumerated list, then note that this algorithm is also depicted in panels (a)-(b) of Figure 1.
Open Source Code	No	No explicit statement about releasing code for the methodology described in this paper or a direct link to a code repository is provided. The paper references third-party tools like Eleuther AI and PEFT but does not provide source code for its own implementation.
Open Datasets	Yes	For these models, we executed the healing step using QLo RA (Dettmers et al., 2023): our models were quantized to 4-bit precision and then ﬁnetuned, using QLo RA for efﬁcient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4) (Raffel et al., 2020), a common pretraining dataset. For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020), a common world-knowledge and problem solving benchmark, and Bool Q (Clark et al., 2019), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. [...] For GSM8K (Cobbe et al., 2021), a grade-school math benchmark, and Hella Swag (Zellers et al., 2019), a multiple choice common-sense reasoning benchmark.
Dataset Splits	Yes	For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) [...] and Bool Q (Clark et al., 2019) [...]. The autoregressive loss on a subset of the C4 validation set. [...] MMLU accuracy (5-shot) vs. fraction of layers dropped [...]. For Co T-MMLU, we followed the flan_cot_fewshot evaluation in Eleuther AI (Gao et al., 2023) [...]. For GSM8K, we used the gsm8k_cot evaluation in Eleuther AI (Gao et al., 2023) and measured pass@1; for each problem we extracted an answer from a single generation (with Co T) [...]. For Hella Swag, we used the hellaswag evaluation in Eleuther AI (Gao et al., 2023).
Hardware Specification	Yes	For our study, we use parameter-efﬁcient ﬁnetuning (PEFT) methods, speciﬁcally quantization and Low Rank Adapters (QLo RA), such that each of our experiments can be performed on a single 40GB A100 GPU.
Software Dependencies	No	The paper mentions PyTorch and the PEFT library (specifically QLoRA is used, which is implemented in PEFT), and the Eleuther AI evaluation framework, but it does not specify explicit version numbers for these software components.
Experiment Setup	Yes	For our study, we use parameter-efﬁcient ﬁnetuning (PEFT) methods, speciﬁcally quantization and Low Rank Adapters (QLo RA), such that each of our experiments can be performed on a single 40GB A100 GPU. For these models, we executed the healing step using QLo RA (Dettmers et al., 2023): our models were quantized to 4-bit precision and then ﬁnetuned, using QLo RA for efﬁcient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4). For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020), a common world-knowledge and problem solving benchmark, and Bool Q (Clark et al., 2019), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. For Co T-MMLU, we followed the flan_cot_fewshot evaluation in Eleuther AI (Gao et al., 2023), in which models produce a chain of thought before generating their answer. For GSM8K, we used the gsm8k_cot evaluation in Eleuther AI (Gao et al., 2023) and measured pass@1; for each problem we extracted an answer from a single generation (with Co T) and checked for correctness against the ground-truth answer.