Sparse Autoencoders for Hypothesis Generation
Authors: Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate HYPOTHESAES against recent state-of-the-art methods. We consider three synthetic tasks, as well as three real-world tasks of practical interest: hypothesizing the relationship between headlines and engagement, speech text and political party, and review text and rating. ... HYPOTHESAES produces many more significant hypotheses than three baseline methods. On three real-world tasks, 45/60 hypotheses generated by our method are significant, compared to at most 24 for the baselines. ... In Table 3, we report the runtimes, LLM inference token counts, and costs (at current Open AI API pricing) for all methods on CONGRESS. |
| Researcher Affiliation | Academia | 1UC Berkeley 2Cornell University 3Cornell Tech. Correspondence to: Rajiv Movva <EMAIL>, Kenny Peng <EMAIL>. |
| Pseudocode | No | The paper describes the HYPOTHESAES method in three steps (Feature generation, Feature selection, Feature interpretation) and provides a visual diagram in Figure 1. However, it does not include any structured pseudocode or algorithm blocks with code-like formatting. |
| Open Source Code | Yes | Code is available on Git Hub , and can be installed via pip with pip install hypothesaes. A notebook to reproduce experimental results in the paper is available in the repository. |
| Open Datasets | Yes | Our synthetic evaluation is motivated by realworld settings in which there are multiple disjoint hypotheses we would like to discover. We therefore use two datasets from prior work on interpretable clustering (Pham et al., 2024; Zhong et al., 2024): WIKI and BILLS. ... HEADLINES (Matias et al., 2021): Which features of digital news headlines predict user engagement? ... YELP (Yelp, 2024): Which features of Yelp restaurant reviews predict users 1-5 star ratings? ... CONGRESS (Gentzkow & Shapiro, 2010): Which features of U.S. congressional speeches predict party affiliation? ... All data used in the paper are available on Hugging Face. |
| Dataset Splits | Yes | For both datasets, we reserve 2,000 items for validation (i.e., SAE hyperparameter selection) and 2,000 heldout items to evaluate hypotheses; we use the remaining items for SAE training and feature selection. ... HEADLINES: our split sizes are 8.8K training, 1K validation, and 4.4K heldout. ... YELP: We use 200K reviews for training, 10K for validation, and 10K for heldout eval. ... CONGRESS: Our split sizes are 114K training, 16K validation, and 12K heldout. |
| Hardware Specification | Yes | Runtimes include training the SAEs on one NVIDIA A6000 GPU. |
| Software Dependencies | Yes | We use GPT-4o with temperature 0.7 to generate concepts and GPT-4o-mini with temperature 0 for concept annotation. ... Interpretation model: GPT-4o (version 2024-11-20). |
| Experiment Setup | Yes | Batch size: 512; learning rate: 5e-4; gradient clipping threshold: 1.0. Epochs: Up to 200, with early stopping after 5 epochs of validation loss not decreasing. ... Temperature: 0.7. Number of highly-activating examples: 10. Number of weakly-activating examples: 10. Maximum word count per example: 256 (examples longer than this are truncated). Number of candidate interpretations: 3 (we choose the highest-fidelity interpretation out of 3 candidates). Number of samples to evaluate fidelity: 200. |