reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sanity Checking Causal Representation Learning on a Simple Real-World System

Authors: Juan L. Gamella, Simon Bing, Jakob Runge

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors the inputs to the experiment are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process.
Researcher Affiliation	Academia	1Seminar for Statistics, ETH Zurich 2Technische Universit at Berlin 3Department of Computer Science, University of Potsdam 4Sca DS.AI Dresden/Leipzig, TU Dresden.
Pseudocode	No	The paper describes methods and implementations through textual descriptions and equations, but does not contain a clearly labeled pseudocode or algorithm block, nor structured steps formatted like code.
Open Source Code	Yes	The code to reproduce the results of this paper can be found at github. com/simonbing/CRLSanity Check.
Open Datasets	Yes	We make the novel datasets and their data-collection procedures publicly available in the lt_crl_benchmark_v1 dataset at github.com/ juangamella/causal-chamber.
Dataset Splits	Yes	The full dataset is then split into train, validation, and test subsets according to the ratios (80/10/10) while ensuring that each subset contains the same fraction of samples from each environment.
Hardware Specification	Yes	All experiments were run on a high-performance cluster with NVIDIA A100 GPUs.
Software Dependencies	No	All implementations use the Py Torch machine learning library (Paszke et al., 2019).
Experiment Setup	Yes	We report the hyperparameters used during training in Table 2.