reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Autoencoders for Hypothesis Generation

Authors: Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate HYPOTHESAES against recent state-of-the-art methods. We consider three synthetic tasks, as well as three real-world tasks of practical interest: hypothesizing the relationship between headlines and engagement, speech text and political party, and review text and rating. ... HYPOTHESAES produces many more significant hypotheses than three baseline methods. On three real-world tasks, 45/60 hypotheses generated by our method are significant, compared to at most 24 for the baselines. ... In Table 3, we report the runtimes, LLM inference token counts, and costs (at current Open AI API pricing) for all methods on CONGRESS.
Researcher Affiliation	Academia	1UC Berkeley 2Cornell University 3Cornell Tech. Correspondence to: Rajiv Movva <EMAIL>, Kenny Peng <EMAIL>.
Pseudocode	No	The paper describes the HYPOTHESAES method in three steps (Feature generation, Feature selection, Feature interpretation) and provides a visual diagram in Figure 1. However, it does not include any structured pseudocode or algorithm blocks with code-like formatting.
Open Source Code	Yes	Code is available on Git Hub , and can be installed via pip with pip install hypothesaes. A notebook to reproduce experimental results in the paper is available in the repository.
Open Datasets	Yes	Our synthetic evaluation is motivated by realworld settings in which there are multiple disjoint hypotheses we would like to discover. We therefore use two datasets from prior work on interpretable clustering (Pham et al., 2024; Zhong et al., 2024): WIKI and BILLS. ... HEADLINES (Matias et al., 2021): Which features of digital news headlines predict user engagement? ... YELP (Yelp, 2024): Which features of Yelp restaurant reviews predict users 1-5 star ratings? ... CONGRESS (Gentzkow & Shapiro, 2010): Which features of U.S. congressional speeches predict party affiliation? ... All data used in the paper are available on Hugging Face.
Dataset Splits	Yes	For both datasets, we reserve 2,000 items for validation (i.e., SAE hyperparameter selection) and 2,000 heldout items to evaluate hypotheses; we use the remaining items for SAE training and feature selection. ... HEADLINES: our split sizes are 8.8K training, 1K validation, and 4.4K heldout. ... YELP: We use 200K reviews for training, 10K for validation, and 10K for heldout eval. ... CONGRESS: Our split sizes are 114K training, 16K validation, and 12K heldout.
Hardware Specification	Yes	Runtimes include training the SAEs on one NVIDIA A6000 GPU.
Software Dependencies	Yes	We use GPT-4o with temperature 0.7 to generate concepts and GPT-4o-mini with temperature 0 for concept annotation. ... Interpretation model: GPT-4o (version 2024-11-20).
Experiment Setup	Yes	Batch size: 512; learning rate: 5e-4; gradient clipping threshold: 1.0. Epochs: Up to 200, with early stopping after 5 epochs of validation loss not decreasing. ... Temperature: 0.7. Number of highly-activating examples: 10. Number of weakly-activating examples: 10. Maximum word count per example: 256 (examples longer than this are truncated). Number of candidate interpretations: 3 (we choose the highest-fidelity interpretation out of 3 candidates). Number of samples to evaluate fidelity: 200.