DOCS: Quantifying Weight Similarity for Deeper Insights into Large Language Models
Authors: Zeping Min, Xinshang Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce a novel index, the Distribution of Cosine Similarity (DOCS), for quantitatively assessing the similarity between weight matrices in Large Language Models (LLMs), aiming to facilitate the analysis of their complex architectures. Leveraging DOCS, our analysis uncovers intriguing patterns in the latest open-source LLMs: adjacent layers frequently exhibit high weight similarity and tend to form clusters, suggesting depth-wise functional specialization. Additionally, we prove that DOCS is theoretically effective in quantifying similarity for orthogonal matrices, a crucial aspect given the prevalence of orthogonal initializations in LLMs. This research contributes to a deeper understanding of LLM architecture and behavior, offering tools with potential implications for developing more efficient and interpretable models. In this work, we extend the application of similarity analysis by directly examining the weight matrices of various LLMs1, instead of focusing on representations. By analyzing the weights themselves, we aim to uncover deeper insights into the model s structure and functionality that are not apparent from representations alone. [...] We conduct experiments to demonstrate the capabilities of DOCS and to gain insights into the internal structure of LLMs. |
| Researcher Affiliation | Industry | Zeping Min Alibaba Group Hupan Laboratory AMSS, Chinese Academy of Sciences EMAIL Xinshang Wang Alibaba Group EMAIL |
| Pseudocode | Yes | Algorithm 1 Computation of the DOCS Similarity Index SDOCS 1: Input: Matrices X = [X1, X2, . . . , Xm] Rn m and Y = [Y1, Y2, . . . , Ym] Rn m 2: Output: Similarity index SDOCS 3: function MAXCOSSIM(A, B) 4: Compute the cosine similarity matrix C Rm m where Cjk = A j Bk Aj Bk 5: For each column Aj, find s Aj = maxk |Cjk| 6: return s A = [s A1, s A2, . . . , s Am] 7: end function 8: Compute s X = MAXCOSSIM(X, Y ) 9: Compute s Y = MAXCOSSIM(Y, X) 10: Fit a Gumbel distribution to s X to estimate the location parameter u X using maximum likelihood estimation 11: Fit a Gumbel distribution to s Y to estimate the location parameter u Y using maximum likelihood estimation 12: Compute the similarity index: SDOCS = u X + u Y |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of its own methodology (DOCS). It mentions 'https://github.com/huggingface/transformers' in a footnote, but this refers to third-party LLM implementations the authors used, not their own code. |
| Open Datasets | Yes | Our analysis uncovers intriguing patterns in the latest open-source LLMs: [...] We conduct experiments to demonstrate the capabilities of DOCS and to gain insights into the internal structure of LLMs. In LLM implementations2, the rows of a weight matrix correspond to output dimensions, and the columns correspond to input dimensions. [...] Figure 2 provides a visual comparison of eight different similarity indices applied to the MLP-UP layers of the Meta-Llama-3.1-8B-Instruct model. [...] We investigated the similarity patterns between neighboring transformer layers by analyzing various weight matrices (Wv, Wk, Wq, Wo, MLP-UP, MLP-DOWN) in various LLMs. We employed DOCS to compute and visualize these similarities. Figure 3 illustrates the results for Wk, Wq, and MLP-DOWN on gemma-2-27b-it. [...] LLMs, including GPT-2 (Radford et al., 2019), Llama (Touvron et al., 2023), Mistral (Jiang et al., 2023), Llama 3 (Dubey et al., 2024), Gpt-neox-20b (Black et al., 2022), Opt (Zhang et al., 2022), Codegeex (Zheng et al., 2023), Glm-130b (Zeng et al., 2022), and Flm (Li et al., 2023), adopt architectures where all layers have the same size. |
| Dataset Splits | No | The paper focuses on analyzing the weights of existing Large Language Models (LLMs) rather than training new models or performing evaluations that require dataset splits (training, validation, test). Therefore, it does not provide information on dataset splits. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments or analyses. |
| Software Dependencies | No | The paper mentions 'https://github.com/huggingface/transformers' as a source for LLM implementations, but it does not specify any software dependencies (libraries, frameworks, or operating systems) with version numbers that were used to implement or run the DOCS methodology. |
| Experiment Setup | No | The paper describes its proposed methodology (DOCS) and analyses performed on existing LLMs. It does not provide details on experimental setup such as hyperparameters, training configurations, learning rates, batch sizes, or optimization settings, as it is not involved in training new models. |