Exploring Activation Patterns of Parameters in Language Models

Authors: Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, Zhifang Sui

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To explain the internal representations of LLMs, we utilize a gradient-based metric to assess the activation level of model parameters. Based on this metric, we obtain three preliminary findings. ... Further, we develop three validation experiments to solidify these findings. (1) Firstly, starting from the first finding, we attempt to configure different sparsities for different layers and find this method can benefit model pruning. ... (2) Secondly, we find that a pruned model based on one calibration set can better handle tasks related to the calibration task than those not related, which validates the second finding. (3) Thirdly, Based on the STS-B and SICK benchmarks, we find that two sentences with consistent semantics tend to share similar parameter activation patterns in deep layers, which aligns with our third finding. Our work sheds light on the behavior of parameter activation in LLMs, and we hope these findings will have the potential to inspire more practical applications.
Researcher Affiliation Academia Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, Zhifang Sui School of Computer Science State Key Laboratory of Multimedia Information Processing Peking University EMAIL, EMAIL
Pseudocode No The paper describes its methodology through descriptive text and mathematical formulas, such as the definition of activation A(s, wi) and the calculation of sparsity Si, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Qian2333/Exploring-Activation-Patterns-of-Parameters-in-Language-Models
Open Datasets Yes For each dataset, we select 64 samples and averaged the activation status of each parameter facing different samples. When calculating the activation status of data in different layers, we collectively consider the parameters within a MLP layer. Specifically, this includes parameters from seven parts: the fully connected linear layers of Q, K, V, O, and the three fully connected linear layers of the MLP layer. For domain-specific tasks, from the statistical results in Figure 1, we can find that fewer parameters are activated in the first layer, meaning that only a small portion of parameters have a significant impact on the results. In the shallow layers, the parameters that have a greater influence (A(wi) > 0.00002) on the results gradually increase. In the relatively deeper layers, the parameters that have a significant impact on the results decrease, concentrating on specific parts. This phenomenon is consistent across the five data sets representing different abilities. This leads us to speculate that for a single task, apart from the first layer, many parameters in the shallow layers are involved in the proceeding of the results. Conversely, in the deep layers and the first layer, fewer parameters have a significant impact on the results. For general corpora, the results remain similar to those of domain-specific tasks in the shallow layers but differ significantly in the deep layers. When dealing with general corpora, the average activation of the parameters in the deep layers is higher than that in specific tasks. This is particularly evident in C4, which shows less specialization than Wikitext2.
Dataset Splits No The paper uses well-known datasets like Boolq, MMLU, Human Eval, SIQA, hellaswag, C4, Wikitext2, STS-B, SICK, GSM8K, and PIQA for evaluation and calibration. However, it only specifies sample sizes used for analysis (e.g., "64 samples", "16 samples", "256 examples") and how calibration data was processed ("concatenated different MMLU and SIQA data into long sentences and cut them to a specific length"), but it does not provide explicit training, validation, or test dataset splits or a detailed methodology for splitting data to reproduce the experimental partitioning.
Hardware Specification No The paper mentions "GPU memory limitations prevent us from verifying the pruning results of Llama2-70B." but does not specify the particular GPU models, CPUs, or any other hardware used for running the reported experiments.
Software Dependencies No The paper mentions using specific Large Language Models (LLMs) like Llama2-7B, Llama3-8B, Qwen-7B and refers to pruning methods from other works (e.g., Sun et al. 2023). However, it does not specify any software versions for libraries, frameworks (like PyTorch or TensorFlow), or other programming dependencies.
Experiment Setup Yes To validate this conclusion, we will prune the model, making the 3-17 layers of the network less sparse, while the 1-2 and 18-32 layers of the network have a higher degree of sparsity. We employed the unstructured pruning method proposed by (Sun et al. 2023) without retraining, maintaining all other settings constant on Llama-7B, Llama2-7B, Llama2-13B, Llama3-8B (AI@Meta 2024). For a language model with L layers, given the total sparsity S and the number of activated parameters (A(wi) > γ) Ni for each ith layer, we set the sparsity of the ith layer Si as Ni = sigmoid(Ni mean(Ni) std(Ni) + α) Si = S + βS(1 Ni L ΣL i=1Ni ) We set α = 2, β = 0.2 for all the experiments. We set γ = 0.00002 for all models with 32 layers, while γ = 0.000005 for 40-layers (Llama-2-13B). Here, we set an heuristic term α i to balance the great gap between the 1-2 layers and deeper layers which make higher sparsity in deeper layers. We use the sigmoid function to maintain a certain amount of parameters during pruning. While the pruning method (Wanda) cannot precisely determine parameters for specific utility, it is essential to maintain a moderate level of sparsity. To compare the test results of the pruned networks, we conducted tests on two different metrics on six datasets based on the original settings. All the calibration dataset is C4 (Raffel et al. 2020). The evaluation tasks include Wikitext2 (Merity et al. 2016), Boolq, SIQA (Sap et al. 2019), PIQA (Bisk et al. 2020), Hellaswag (Zellers et al. 2019), MMLU. As shown in Table 1, our method consistently improves the perplexity on Wikitext2 across all model results, suggesting that our approach can generally enhance the language modeling capability of the models. In the zero-shot results, it is worth noting that the improvements in Hellaswag are universal, while Boolq generally experiences a decline. In conjunction with Figure 6, we find that Hellaswag has the highest correlation with C4, while Boolq is relatively lower. Therefore, the results of using C4 as the calibration set are less satisfactory in Boolq.