Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Authors: Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, Yisen Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across multiple robust learning scenarios including input and label noise, few-shot learning, and out-of-domain generalization our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. |
| Researcher Affiliation | Academia | 1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 New York University 4 TUM CIT, MCML, MDSI 5 MIT EECS, CSAIL 6 Institute for Artificial Intelligence, Peking University |
| Pseudocode | No | The paper describes methods using mathematical formulas (Eq. 1, Eq. 2) and provides code listings for prompts (Listing 1, Listing 2), but does not contain any structured pseudocode or algorithm blocks describing a procedure. |
| Open Source Code | Yes | Code is available at https://github.com/PKU-ML/Monosemanticity-Robustness. |
| Open Datasets | Yes | We pretrain a Res Net-18 (He et al., 2016) backbone with the widely-used contrastive framework Sim CLR (Chen et al., 2020) on CIFAR-100 and Image Net-100. ... further finetune it with SST2 (Socher et al., 2013) as the review sentiment classification task and Dolly (Conover et al., 2023) datasets as the dialogue generation task. ... evaluate the alignment of model responses based on the response on Beavertails datasets (Ji et al., 2024). |
| Dataset Splits | Yes | For few-shot finetuning, we respectively random draw 10%, 20%, 50%, and 100$ training samples from the original Image Net-100 training set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., Res Net-18, Sim CLR, Llama-2-7B-Chat) but does not provide specific version numbers for any ancillary software dependencies or libraries. |
| Experiment Setup | Yes | We pretrain the model for 200 epochs. The projector is a two-layer MLP with a hidden dimension 16384 and an output dimension 2048. We train the models with batch size 256 and weight decay 0.0001. ... we finetune the Llama-2-7b-Chat model in SST2 with 20 epochs, batch size 16 and learning rate 1e-4. we Lora with rank r = 8, scaling factor α = 4, and dropout rate 0.1 as default. |