Automated Hypothesis Validation with Agentic Sequential Falsifications
Authors: Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candes, Jure Leskovec
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate POPPER on six domains including biology, economics, and sociology. POPPER delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, POPPER achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation. We instantiated POPPER across six diverse domains, including biology, sociology, and economics. In our implementation, POPPER designs falsification experiments by leveraging large-scale, hypothesis-free datasets and executes them with a Python code environment. Our results demonstrate that POPPER effectively controls the Type-I error rate while achieving significant power improvements over existing methods. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University 2Data Science Initiative & Department of Health Care Policy, Harvard University 3Department of Statistics, Stanford University 4Department of Mathematics, Stanford University. |
| Pseudocode | Yes | Table 2: Experiment execution example. Execution steps for the experiment Test if variants in the MAK16 locus region show overrepresentation of immune-trait GWAS associations. We provide a summarized pseudo-code here for illustration purposes. Algorithm 1 Sequential Falsification with Hypothesis Agent |
| Open Source Code | Yes | POPPER is freely available at https://github. com/snap-stanford/POPPER. |
| Open Datasets | Yes | Our demonstration uses two collections. The first, Target Validation (Target Val), addresses genotype-phenotype hypotheses in biology; it aggregates 22 tables (totaling 85 million records) from sources such as GTEx (Consortium, 2020), GWAS Catalog (Mac Arthur et al., 2017), and Bio Grid (Oughtred et al., 2019). The second, Discovery Bench (Majumder et al., 2024), spans six domains (sociology, biology, humanities, economics, engineering, and meta-science), yielding 86 nonnull hypotheses (after deduplication) that are grounded in peer-reviewed research. |
| Dataset Splits | Yes | We assess Type-I error by creating negative examples through random column-wise permutations in each dataset, ensuring the null hypothesis holds. For Discovery Bench, we generate as many negative examples as positive ones. For the target validation benchmark (with only 20 positives), we create 50 negatives. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or cloud instance specifications) are mentioned in the paper for running the experiments. The paper discusses using LLM backbones but does not detail the hardware environment for these or the local execution environment. |
| Software Dependencies | No | Section 3 states, 'we provide a coding environment where it can write and run Python scripts using essential libraries including pandas, statsmodels, and scipy.' While these libraries are mentioned, specific version numbers are not provided, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | We set a nominal Type-I error level α = 0.1. Unless noted otherwise, we use Claude-Sonnet-3.5 as our LLM, with a maximum of 3 tests on Discovery Bench and 5 on target validation (due to more complex hypotheses in the latter scenario). |