SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Authors: Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we developed SAE-V, a mechanistic interpretability framework for MLLMs that extends the SAE paradigm to MLLMs... Experiments demonstrate that our filtering tool achieves more than 110% performance compared to the full dataset while using 50% less data, underscoring the efficiency and effectiveness of SAE-V.
Researcher Affiliation Academia 1Institute for AI, Peking University, Beijing, China 2State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University, Beijing, China. Correspondence to: Hantao Lou <EMAIL>, Yaodong Yang <EMAIL>.
Pseudocode Yes Algorithm 1 Cosine similarity score Ranking... Algorithm 2 L0-based Ranking... Algorithm 3 Co-ocurring L0-based Ranking... Algorithm 4 L0 patch filter... Algorithm 5 L1 patch filter... Algorithm 6 Co-occuring L0 patch filter... Algorithm 7 Cosine similarity score patch filter...
Open Source Code Yes *Our codebase and model are released at Github and Huggingface. The source code and checkpoints of SAE-V mentioned in this paper will be released under the CC BY-NC 4.0 license.
Open Datasets Yes For text-only and multimodal situations, we selected the Pile (Gao et al., 2020) and Obelics (Laurenc on et al., 2023) datasets separately... Image Net dataset (Russakovsky et al., 2015)... Align-Anything (Ji et al., 2024) text-image-to-text dataset... RLAIF-V (Yu et al., 2024) and MMInstruct (Liu et al., 2024b) datasets...
Dataset Splits Yes Specifically, we sampled 100K data from each dataset as the train set and 10K data as the test set... The filtered datasets were then used to fine-tune MLLMs... Table 6. Hyperparameters of SFT training and DPO training. val size 0.1 0.1
Hardware Specification Yes All SAE and SAE-V training is performed on 8 A800 GPUs and each training typically takes around 21 hours.
Software Dependencies No The paper does not explicitly list specific software components with their version numbers (e.g., Python, PyTorch versions) used in the experiments.
Experiment Setup Yes Table 4. Hyperparameters of training SAE and SAE-V models. Training Parameters: total training steps 30000, batch size 4096, LR 5e-5, LR warmup steps 1500, LR decay steps 6000, adam beta1 0.9, adam beta2 0.999, LR scheduler name constant, LR coefficient 5, seed 42, dtype float32, buffer batches num 32, store batch size prompts 4, feature sampling window 1000, dead feature window 1000, dead feature threshold 1e-4, SAE and SAE-V Parameters: hook layer 16, input dimension 4096, expansion factor 16, feature number 65536, context size 4096. Table 6. Hyperparameters of SFT training and DPO training. max length 4096, per device train batch size 8, per device eval batch size 8, gradient accumulation steps 4, LR scheduler type cosine, LR 1e-6, warmup steps 10, eval steps 50, epochs 3, val size 0.1, bf16 True.