reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Standardizing Structural Causal Models

Authors: Weronika Ormaniec, Scott Sussex, Lars Lorch, Bernhard Schölkopf, Andreas Krause

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose a simple modification of SCMs that stabilizes the data-generating process and thereby removes exploitable covariance artifacts. Our models, denoted internally-standardized SCMs (i SCMs), introduce a standardization operation at each variable during the generative process (Figure 1b). In Section 4, we provide a theoretical motivation for this idea by studying linear i SCMs. We prove that, contrary to SCMs, the causal dependencies of i SCMs under mild assumptions never collapse to deterministic mechanisms as the graph size becomes large. Moreover, we formalize the correlation artifact commonly observed in benchmarks by proving that linear SCM structures in a Markov equivalence class (MEC) are partially identifiable for certain graph classes, given weak prior knowledge on the weight distribution of the ground-truth SCM. Most importantly, we show that this is not the case for the corresponding i SCMs. In Section 5, we empirically demonstrate that the baselines proposed in Reisach et al. (2021; 2024) are unable to exploit covariance artifacts in i SCMs, while practical classes of causal discovery algorithms are still able to learn causal structures in both linear and nonlinear systems. Our findings reveal that SCM artifacts affect structure learning both positively and negatively, making i SCMs a practical tool, alongside SCMs, for disentangling the drivers of causal discovery performance of different algorithms in practice.
Researcher Affiliation	Academia	Weronika Ormaniec ETH Zürich, Switzerland EMAIL Scott Sussex ETH Zürich, Switzerland EMAIL Lars Lorch ETH Zürich, Switzerland EMAIL Bernhard Schölkopf MPI for Intelligent Systems, Tübingen, Germany EMAIL Andreas Krause ETH Zürich, Switzerland EMAIL
Pseudocode	Yes	Algorithm 1 Sampling from an i SCM Algorithm 2 Computing the Implied Model Parameters of Linear i SCMs
Open Source Code	Yes	Our code is publicly available at: https://github.com/werkaaa/iscm.
Open Datasets	Yes	To facilitate reproducibility, we provide code, configuration files, and the commands used to obtain all the experimental results in this manuscript as supplementary material. They are also available at: https://github.com/werkaaa/iscm. In Appendix E , we describe the experimental setup, including the computational resources and wall time used to produce the results.
Dataset Splits	No	The paper describes generating synthetic data and evaluates performance across different numbers of systems and samples per system (e.g., 'For every model, we evaluate 100 systems and n =1000 samples each.'). It also mentions 'held-out instances' for hyperparameter tuning. However, it does not provide specific training/test/validation dataset splits of a single dataset, as data is generated for each evaluation.
Hardware Specification	No	Our experiments were run on an internal cluster. All experiments in this work were computed using CPUs with 3GB of memory per CPU, with an exception of the AVICI runs on graphs with 100 vertices, which used 12GB per CPU.
Software Dependencies	No	The paper mentions using several software packages and libraries such as 'NOTEARS (Zheng et al., 2018)', 'AVICI (Lorch et al., 2022)', 'Causal Disco library', 'GOLEM (Ng et al., 2020)', 'Causal Discovery Toolbox (Kalainathan et al., 2020)', 'dodiscover library', and 'LINGAM (Shimizu et al., 2006)'. However, it does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	Before benchmarking NOTEARS, we run a hyperparameter search to calibrate the weight penalty (λ) and threshold on held-out instances of each data generation method. The hyperparameters can be found in Appendix E.4. Table 1 presents all final hyperparameter configurations for NOTEARS.