Automated Hypothesis Validation with Agentic Sequential Falsifications

Authors: Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candes, Jure Leskovec

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate POPPER on six domains including biology, economics, and sociology. POPPER delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, POPPER achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation. We instantiated POPPER across six diverse domains, including biology, sociology, and economics. In our implementation, POPPER designs falsification experiments by leveraging large-scale, hypothesis-free datasets and executes them with a Python code environment. Our results demonstrate that POPPER effectively controls the Type-I error rate while achieving significant power improvements over existing methods.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University 2Data Science Initiative & Department of Health Care Policy, Harvard University 3Department of Statistics, Stanford University 4Department of Mathematics, Stanford University.
Pseudocode Yes Table 2: Experiment execution example. Execution steps for the experiment Test if variants in the MAK16 locus region show overrepresentation of immune-trait GWAS associations. We provide a summarized pseudo-code here for illustration purposes. Algorithm 1 Sequential Falsification with Hypothesis Agent
Open Source Code Yes POPPER is freely available at https://github. com/snap-stanford/POPPER.
Open Datasets Yes Our demonstration uses two collections. The first, Target Validation (Target Val), addresses genotype-phenotype hypotheses in biology; it aggregates 22 tables (totaling 85 million records) from sources such as GTEx (Consortium, 2020), GWAS Catalog (Mac Arthur et al., 2017), and Bio Grid (Oughtred et al., 2019). The second, Discovery Bench (Majumder et al., 2024), spans six domains (sociology, biology, humanities, economics, engineering, and meta-science), yielding 86 nonnull hypotheses (after deduplication) that are grounded in peer-reviewed research.
Dataset Splits Yes We assess Type-I error by creating negative examples through random column-wise permutations in each dataset, ensuring the null hypothesis holds. For Discovery Bench, we generate as many negative examples as positive ones. For the target validation benchmark (with only 20 positives), we create 50 negatives.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or cloud instance specifications) are mentioned in the paper for running the experiments. The paper discusses using LLM backbones but does not detail the hardware environment for these or the local execution environment.
Software Dependencies No Section 3 states, 'we provide a coding environment where it can write and run Python scripts using essential libraries including pandas, statsmodels, and scipy.' While these libraries are mentioned, specific version numbers are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup Yes We set a nominal Type-I error level α = 0.1. Unless noted otherwise, we use Claude-Sonnet-3.5 as our LLM, with a maximum of 3 tests on Discovery Bench and 5 on target validation (due to more complex hypotheses in the latter scenario).