The Unreasonable Ineffectiveness of the Deeper Layers
Authors: Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our first set of results are shown in Figure 2, where we plot 5-shot MMLU accuracy as a function of the fraction of layers removed: in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we show Mistral-7B and Phi-2. In order to better compare models of different total number of layers, in these plots we opted to normalize the x-axis by the fraction of layers removed (rather than the absolute number of layers removed). Note that since MMLU contains multiple choice questions with four possible responses, the expected accuracy of random guessing is 25%. |
| Researcher Affiliation | Collaboration | Andrey Gromov Meta FAIR & UMD Kushal Tirumala Meta FAIR Hassan Shapourian Cisco Paolo Glorioso Zyphra Daniel A. Roberts MIT & Sequoia Capital. Co-first authors; please direct correspondence to the union of {EMAIL, EMAIL, EMAIL}. |
| Pseudocode | Yes | Our principal layer pruning algorithm is very simple: 0. Pick a a number of layers to prune n. 1. Compute the angular distance d(x( ), x( +n)), cf. (7) below, between the input to layer and the input to layer + n on a neutral pretraining dataset or on a dataset representative of a downstream task of interest. 2. Find the layer, , that minimizes that distance: ?(n) arg min d(x( ), x( +n)) . (6) 3. Drop layers ? to ?+n 1; connect the old input to layer ? to the old ( ?+n)th layer block. 4. (Optionally) heal the mismatch at layer ? + n with a small amount of fine tuning on a neutral pretraining dataset or particular dataset of interest. If fewer words inside of a figure are more helpful to you than the text in an enumerated list, then note that this algorithm is also depicted in panels (a)-(b) of Figure 1. |
| Open Source Code | No | No explicit statement about releasing code for the methodology described in this paper or a direct link to a code repository is provided. The paper references third-party tools like Eleuther AI and PEFT but does not provide source code for its own implementation. |
| Open Datasets | Yes | For these models, we executed the healing step using QLo RA (Dettmers et al., 2023): our models were quantized to 4-bit precision and then finetuned, using QLo RA for efficient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4) (Raffel et al., 2020), a common pretraining dataset. For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020), a common world-knowledge and problem solving benchmark, and Bool Q (Clark et al., 2019), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. [...] For GSM8K (Cobbe et al., 2021), a grade-school math benchmark, and Hella Swag (Zellers et al., 2019), a multiple choice common-sense reasoning benchmark. |
| Dataset Splits | Yes | For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) [...] and Bool Q (Clark et al., 2019) [...]. The autoregressive loss on a subset of the C4 validation set. [...] MMLU accuracy (5-shot) vs. fraction of layers dropped [...]. For Co T-MMLU, we followed the flan_cot_fewshot evaluation in Eleuther AI (Gao et al., 2023) [...]. For GSM8K, we used the gsm8k_cot evaluation in Eleuther AI (Gao et al., 2023) and measured pass@1; for each problem we extracted an answer from a single generation (with Co T) [...]. For Hella Swag, we used the hellaswag evaluation in Eleuther AI (Gao et al., 2023). |
| Hardware Specification | Yes | For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLo RA), such that each of our experiments can be performed on a single 40GB A100 GPU. |
| Software Dependencies | No | The paper mentions PyTorch and the PEFT library (specifically QLoRA is used, which is implemented in PEFT), and the Eleuther AI evaluation framework, but it does not specify explicit version numbers for these software components. |
| Experiment Setup | Yes | For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLo RA), such that each of our experiments can be performed on a single 40GB A100 GPU. For these models, we executed the healing step using QLo RA (Dettmers et al., 2023): our models were quantized to 4-bit precision and then finetuned, using QLo RA for efficient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4). For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020), a common world-knowledge and problem solving benchmark, and Bool Q (Clark et al., 2019), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. For Co T-MMLU, we followed the flan_cot_fewshot evaluation in Eleuther AI (Gao et al., 2023), in which models produce a chain of thought before generating their answer. For GSM8K, we used the gsm8k_cot evaluation in Eleuther AI (Gao et al., 2023) and measured pass@1; for each problem we extracted an answer from a single generation (with Co T) and checked for correctness against the ground-truth answer. |