Wasserstein Distances, Neuronal Entanglement, and Sparsity
Authors: Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, Nir Shavit
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To analyze the phenomenon of neuronal superposition under sparsity in greater detail, we create an experimental framework, which we dub Sparse Expansion. It expands a model into a mixture of sparse experts by clustering input embeddings layer-wise. Based on this clustering, Sparse Expansion utilizes the input-aware nature of the Sparse GPT (Frantar & Alistarh, 2023) pruning algorithm to specialize different sparse experts to different sets of inputs, starting from the same base weights. Through Sparse Expansion, we are able to analyze the entangled neurons in much more detail, since now different subgroups of the inputs are being computed with different edges (Figure 1f, A8f). We find that as a neuron lose edges, its output distribution tends to shift toward a Gaussian distribution (Figure A9). However, through Sparse Expansion, the original output distribution can be better preserved under sparse computation (Figure 1e, A8e). We relate our findings to recent theoretical work on the bounds of neural computation under superposition (H anni et al., 2024; Adler & Shavit, 2024). |
| Researcher Affiliation | Collaboration | Shashata Sawmya1 , Linghao Kong1 , Ilia Markov2, Dan Alistarh2,3,4, & Nir Shavit1,3,4 1MIT 2IST Austria 3Neural Magic 4Red Hat EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm A1 describes the sparsification process of Sparse Expansion. The sparse experts are created in a layer-wise sequential fashion for each linear layer of every FFN transformer block to create the sparse model. Algorithm A2 refers to the inference procedure of Sparse Expansion once the model is pruned following the methods described in Algorithm A1 and Section 3.1. |
| Open Source Code | Yes | 1Code available at https://github.com/Shavit-Lab/Sparse-Expansion. |
| Open Datasets | Yes | For reading comprehension, we use the 1-shot variant of the SQu AD 2.0 dataset (Rajpurkar et al., 2018). To assess knowledge reasoning and mathematical capabilities, we evaluate the model on the 5-shot Trivia QA-Wiki (Joshi et al., 2017) and 5-shot GSM8K (Cobbe et al., 2021) datasets, respectively. Finally, to evaluate general reasoning, we test the model on two benchmarks: an easy task, 5-shot MMLU (Hendrycks et al., 2020), and a more challenging task, 3-shot Chain-of-Thought (Co T) Big Bench Hard (BBH) (Suzgun et al., 2022). |
| Dataset Splits | No | The paper states using "a subset of the Wikitext-2 train dataset as calibration data for input-aware pruning and evaluate using the corresponding test set through the perplexity metric" and specifies "N-shot" settings for evaluation benchmarks, but does not provide specific percentages, sample counts, or explicit references to predefined splits with full bibliographic information for the splits themselves to ensure reproducibility of data partitioning. |
| Hardware Specification | Yes | We have run the layer-wise benchmarks for the typical layers sizes from Llama models on a single RTX3090 GPU. |
| Software Dependencies | No | The paper mentions several software components like "SciPy", "RAPIDS library", "PyTorch", "Sparse Marlin", and "Sparse GPT GitHub repository", but consistently omits specific version numbers for these, which are necessary for reproducible dependency management. |
| Experiment Setup | Yes | For our performance benchmarks, we use 16 clusters at each level of routing in Sparse Expansion. We evaluate the performance of Sparse Expansion against other one-shot pruning techniques across a range of model sizes in Pythia and sparsities in Llama-2-7B (Figure 9). Across all model sizes of Pythia, Sparse Expansion outperforms all other pruning techniques at 50% unstructured sparsity, approaching dense performance as model size increases. Moreover, for Llama-2-7B, across all levels of sparsity, Sparse Expansion outperforms all other techniques. |