reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Authors: Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, Yisen Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across multiple robust learning scenarios including input and label noise, few-shot learning, and out-of-domain generalization our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity.
Researcher Affiliation	Academia	1 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 MIT CSAIL 3 New York University 4 TUM CIT, MCML, MDSI 5 MIT EECS, CSAIL 6 Institute for Artificial Intelligence, Peking University
Pseudocode	No	The paper describes methods using mathematical formulas (Eq. 1, Eq. 2) and provides code listings for prompts (Listing 1, Listing 2), but does not contain any structured pseudocode or algorithm blocks describing a procedure.
Open Source Code	Yes	Code is available at https://github.com/PKU-ML/Monosemanticity-Robustness.
Open Datasets	Yes	We pretrain a Res Net-18 (He et al., 2016) backbone with the widely-used contrastive framework Sim CLR (Chen et al., 2020) on CIFAR-100 and Image Net-100. ... further finetune it with SST2 (Socher et al., 2013) as the review sentiment classification task and Dolly (Conover et al., 2023) datasets as the dialogue generation task. ... evaluate the alignment of model responses based on the response on Beavertails datasets (Ji et al., 2024).
Dataset Splits	Yes	For few-shot finetuning, we respectively random draw 10%, 20%, 50%, and 100$ training samples from the original Image Net-100 training set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions various models and frameworks (e.g., Res Net-18, Sim CLR, Llama-2-7B-Chat) but does not provide specific version numbers for any ancillary software dependencies or libraries.
Experiment Setup	Yes	We pretrain the model for 200 epochs. The projector is a two-layer MLP with a hidden dimension 16384 and an output dimension 2048. We train the models with batch size 256 and weight decay 0.0001. ... we finetune the Llama-2-7b-Chat model in SST2 with 20 epochs, batch size 16 and learning rate 1e-4. we Lora with rank r = 8, scaling factor α = 4, and dropout rate 0.1 as default.