Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Authors: Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through qualitative and quantitative evaluations, we show that the resulting concept space captures interpretable features shared across all models. ... This section is split into six parts. We first provide experimental implementation details. Then, we qualitatively analyze universal concepts discovered by USAEs (Sec. 4.1). Next, we provide a quantitative analysis of USAEs through the validation of activation reconstruction (Sec. 4.2), measuring the universality and importance of concepts (Secs. 4.3), and investigating the consistency between concepts in USAEs and individually trained SAE counterparts (Sec. 4.4). Finally, we provide a finer-grained analysis via the application of USAEs to coordinated activation maximization (Sec. 4.5).
Researcher Affiliation Collaboration 1York University, Toronto, Canada 2Vector Institute, Toronto, Canada 3Kempner Institute, Harvard University, Boston, USA 4FAR.AI 5Trajectory Labs, Toronto 6University of Toronto, Toronto, Canada 7Samsung AI Centre, Toronto.
Pseudocode Yes def train_usae(Ψθ, D, A, T, Optimizers): M = len(Ψθ) for t in range(T): i = random(M) Z = Ψ(i) θ (A(i)) L = 0.0 for j in range(M): b A(j) = Z @ D(j) L += (A(j) b A(j)).norm(p= fro ) L.backward() Optimizers[i].step() return Ψθ, D Figure 3. Training Universal Sparse Autoencoder.
Open Source Code Yes Code: github.com/YorkUCVIL/UniversalSAE.
Open Datasets Yes We train a USAE on the final layer activations of three popular vision models: Dino V2 (Oquab et al., 2023; Darcet et al., 2024), Sig LIP (Zhai et al., 2023), and Vi T (Dosovitskiy et al., 2020) (trained on Image Net (Deng et al., 2009)). ... We use DTD (Cimpoi et al., 2014) and Celeb A (Liu et al., 2015) as the validation dataset...
Dataset Splits Yes For all experiments, we train the USAE on the Image Net training set, while the validation set is reserved for qualitative visualizations and quantitative evaluations.
Hardware Specification Yes We train all USAEs on a single Nvidia RTX 6000 GPU, with training completing in approximately three days (see Appendix A.1 for more implementation details).
Software Dependencies No The models were sourced from the timm library (Wightman, 2019). All SAE encoder-decoder pairs have independent Adam optimizers (Kingma & Ba, 2015). The encoder consists of a single linear layer followed by batch normalization (Ioffe & Szegedy, 2015).
Experiment Setup Yes For all experiments, we use a dictionary of size 6144. All SAE encoder-decoder pairs have independent Adam optimizers (Kingma & Ba, 2015), each with an initial learning rate of 3e 4, which decays to 1e 6 following a cosine schedule with linear warmup. To account for variations in activation scales caused by architectural differences, we standardize each model s activations using 1000 random samples from the training set. Since Sig LIP does not incorporate a class token, we remove class tokens from Dino V2 and Vi T to ensure consistency across models. Additionally, we interpolate the Dino V2 token count to match a patch size of 16 16 pixels, aligning it with Sig LIP and Vi T.