reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpreting CLIP with Hierarchical Sparse Autoencoders

Authors: Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct extensive experiments to evaluate MSAE against Re LU and Top K SAEs. We compare the sparsity fidelity trade-off (Section 4.2), at multiple granularity levels (Section 4.3). We follow with evaluating the semantic quality of learned representations beyond traditional distance metrics (Section 4.4), analyzing decoder orthogonality (Section 4.5), and examining the statistical properties of SAE activation magnitudes (Section 4.6). To verify that MSAE successfully learns hierarchical features, we conduct experiments on the progressive recovery task (Section 4.7).
Researcher Affiliation	Academia	1Warsaw University of Technology, Warsaw, Poland 2University of Warsaw, Warsaw, Poland. Correspondence to: Vladimir Zaigrajew <EMAIL>.
Pseudocode	No	The paper describes the model architecture using mathematical equations (e.g., Equation 1) and prose, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	We make the codebase available at https://github.com/WolodjaZ/MSAE.
Open Datasets	Yes	All SAE models are trained on the CC3M (Sharma et al., 2018) training set with features (post-pooled) from the CLIP Vi T-L/14 or Vi T-B/16 model. Image modality is evaluated on Image Net-1k training set (Russakovsky et al., 2015), while text modality is evaluated on the CC3M validation set.
Dataset Splits	Yes	All SAE models are trained on the CC3M (Sharma et al., 2018) training set with features (post-pooled) from the CLIP Vi T-L/14 or Vi T-B/16 model. Image modality is evaluated on Image Net-1k training set (Russakovsky et al., 2015), while text modality is evaluated on the CC3M validation set.
Hardware Specification	Yes	All models were trained for 30 epochs on a single NVIDIA A100 GPU with batch size 4096, except for the model with an expansion rate of 32, which was trained for 20 epochs.
Software Dependencies	No	The paper mentions software components like 'Adam W optimizer' and 'Reduce LROn Plateau scheduler' but does not provide specific version numbers for any key software dependencies or libraries.
Experiment Setup	Yes	All models were trained for 30 epochs on a single NVIDIA A100 GPU with batch size 4096, except for the model with an expansion rate of 32, which was trained for 20 epochs. For Vi T-L/14, we explored parameters near RN50-optimal values to ensure cross-architecture consistency. With expansion factor 8 (768 → 6144), we explore: Learning rates per method: 1⋅10−5, 5⋅10−5, 1⋅10−4, 5⋅10−4, 1⋅10−3. Re LU L1 coefficients (λ): 1⋅10−4, 3⋅10−3, 1⋅10−3, 3⋅10−2. Top K values: k ∈ {32, 64, 128, 256}. Matryoshka K-lists: {32...6144} and {64...6144}. α coefficients: uniform weighting (UW) {1,1,1,1,1,1,1} and reverse weighting (RW) {7,6,5,4,3,2,1}.