reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Authors: Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we developed SAE-V, a mechanistic interpretability framework for MLLMs that extends the SAE paradigm to MLLMs... Experiments demonstrate that our filtering tool achieves more than 110% performance compared to the full dataset while using 50% less data, underscoring the efficiency and effectiveness of SAE-V.
Researcher Affiliation	Academia	1Institute for AI, Peking University, Beijing, China 2State Key Laboratory of General Artificial Intelligence, Institute for AI, Peking University, Beijing, China. Correspondence to: Hantao Lou <EMAIL>, Yaodong Yang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Cosine similarity score Ranking... Algorithm 2 L0-based Ranking... Algorithm 3 Co-ocurring L0-based Ranking... Algorithm 4 L0 patch filter... Algorithm 5 L1 patch filter... Algorithm 6 Co-occuring L0 patch filter... Algorithm 7 Cosine similarity score patch filter...
Open Source Code	Yes	*Our codebase and model are released at Github and Huggingface. The source code and checkpoints of SAE-V mentioned in this paper will be released under the CC BY-NC 4.0 license.
Open Datasets	Yes	For text-only and multimodal situations, we selected the Pile (Gao et al., 2020) and Obelics (Laurenc on et al., 2023) datasets separately... Image Net dataset (Russakovsky et al., 2015)... Align-Anything (Ji et al., 2024) text-image-to-text dataset... RLAIF-V (Yu et al., 2024) and MMInstruct (Liu et al., 2024b) datasets...
Dataset Splits	Yes	Specifically, we sampled 100K data from each dataset as the train set and 10K data as the test set... The filtered datasets were then used to fine-tune MLLMs... Table 6. Hyperparameters of SFT training and DPO training. val size 0.1 0.1
Hardware Specification	Yes	All SAE and SAE-V training is performed on 8 A800 GPUs and each training typically takes around 21 hours.
Software Dependencies	No	The paper does not explicitly list specific software components with their version numbers (e.g., Python, PyTorch versions) used in the experiments.
Experiment Setup	Yes	Table 4. Hyperparameters of training SAE and SAE-V models. Training Parameters: total training steps 30000, batch size 4096, LR 5e-5, LR warmup steps 1500, LR decay steps 6000, adam beta1 0.9, adam beta2 0.999, LR scheduler name constant, LR coefficient 5, seed 42, dtype float32, buffer batches num 32, store batch size prompts 4, feature sampling window 1000, dead feature window 1000, dead feature threshold 1e-4, SAE and SAE-V Parameters: hook layer 16, input dimension 4096, expansion factor 16, feature number 65536, context size 4096. Table 6. Hyperparameters of SFT training and DPO training. max length 4096, per device train batch size 8, per device eval batch size 8, gradient accumulation steps 4, LR scheduler type cosine, LR 1e-6, warmup steps 10, eval steps 50, epochs 3, val size 0.1, bf16 True.