Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Authors: Gouki Gouki, Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features. The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word. |
| Researcher Affiliation | Academia | Gouki Minegishi1 Hiroki Furuta1 Yusuke Iwasawa1 Yutaka Matsuo1 1The University of Tokyo EMAIL |
| Pseudocode | No | The paper includes mathematical formulations like equations (1) to (4) but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps. |
| Open Source Code | Yes | 1Code: https://github.com/gouki510/PS-Eval |
| Open Datasets | Yes | Dataset: https://huggingface.co/datasets/gouki510/Wic_data_for_SAE-Eval |
| Dataset Splits | No | The paper states that the PS-Eval dataset consists of "1112 (label 0: 556, label 1: 556)" samples for evaluation (Table 1) and describes the WiC and Red Pajama datasets for training data. However, it does not provide specific training/test/validation splits (e.g., percentages or exact counts) for their own SAE training experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions "GPT2-small" as the base LLM used, but does not list any specific software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | Following prior work (Templeton et al., 2024), we use an expand ratio of R = 32 and a sparsity regularization factor of λ = 0.05 by default for training SAE. The base LLM used as activations for the SAE is GPT-2 small (Radford et al., 2019). Unless specified otherwise, activations are extracted from the 4th layer. ... (Table 4) Batch Size 8192, Total Training Steps 200,000, Learning Rate 2e-4, Context Size 256. |