Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment
Authors: Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through qualitative and quantitative evaluations, we show that the resulting concept space captures interpretable features shared across all models. ... This section is split into six parts. We first provide experimental implementation details. Then, we qualitatively analyze universal concepts discovered by USAEs (Sec. 4.1). Next, we provide a quantitative analysis of USAEs through the validation of activation reconstruction (Sec. 4.2), measuring the universality and importance of concepts (Secs. 4.3), and investigating the consistency between concepts in USAEs and individually trained SAE counterparts (Sec. 4.4). Finally, we provide a finer-grained analysis via the application of USAEs to coordinated activation maximization (Sec. 4.5). |
| Researcher Affiliation | Collaboration | 1York University, Toronto, Canada 2Vector Institute, Toronto, Canada 3Kempner Institute, Harvard University, Boston, USA 4FAR.AI 5Trajectory Labs, Toronto 6University of Toronto, Toronto, Canada 7Samsung AI Centre, Toronto. |
| Pseudocode | Yes | def train_usae(Ψθ, D, A, T, Optimizers): M = len(Ψθ) for t in range(T): i = random(M) Z = Ψ(i) θ (A(i)) L = 0.0 for j in range(M): b A(j) = Z @ D(j) L += (A(j) b A(j)).norm(p= fro ) L.backward() Optimizers[i].step() return Ψθ, D Figure 3. Training Universal Sparse Autoencoder. |
| Open Source Code | Yes | Code: github.com/YorkUCVIL/UniversalSAE. |
| Open Datasets | Yes | We train a USAE on the final layer activations of three popular vision models: Dino V2 (Oquab et al., 2023; Darcet et al., 2024), Sig LIP (Zhai et al., 2023), and Vi T (Dosovitskiy et al., 2020) (trained on Image Net (Deng et al., 2009)). ... We use DTD (Cimpoi et al., 2014) and Celeb A (Liu et al., 2015) as the validation dataset... |
| Dataset Splits | Yes | For all experiments, we train the USAE on the Image Net training set, while the validation set is reserved for qualitative visualizations and quantitative evaluations. |
| Hardware Specification | Yes | We train all USAEs on a single Nvidia RTX 6000 GPU, with training completing in approximately three days (see Appendix A.1 for more implementation details). |
| Software Dependencies | No | The models were sourced from the timm library (Wightman, 2019). All SAE encoder-decoder pairs have independent Adam optimizers (Kingma & Ba, 2015). The encoder consists of a single linear layer followed by batch normalization (Ioffe & Szegedy, 2015). |
| Experiment Setup | Yes | For all experiments, we use a dictionary of size 6144. All SAE encoder-decoder pairs have independent Adam optimizers (Kingma & Ba, 2015), each with an initial learning rate of 3e 4, which decays to 1e 6 following a cosine schedule with linear warmup. To account for variations in activation scales caused by architectural differences, we standardize each model s activations using 1000 random samples from the training set. Since Sig LIP does not incorporate a class token, we remove class tokens from Dino V2 and Vi T to ensure consistency across models. Additionally, we interpolate the Dino V2 token count to match a patch size of 16 16 pixels, aligning it with Sig LIP and Vi T. |