reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Invariant Structure Learning for Better Generalization and Causal Explainability

Authors: Yunhao Ge, Sercan O Arik, Jinsung Yoon, Ao Xu, Laurent Itti, Tomas Pfister

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the eﬀectiveness of ISL on various synthetic and real-world datasets. ISL yields state-of-the-art SCM discovery (clearly outperforming alternatives on real-world data) with a particularly prominent improvement for complex graphs structures. In addition, ISL improves the test prediction accuracy throughout, with especially large improvements in cases with signiﬁcant data drifts (up to 80% MSE reduction compared to alternatives). Section 4 Experiments: In this section, we evaluate the proposed ISL framework for causal explainability and better generalization. We conduct extensive experiments in two settings based on the availability of target labels: supervised learning tasks in Sec. 4.1 and self-supervised learning tasks in Sec. 4.2. Details and more results are provided in the Appendix D. Baselines: On causal explainability, we choose NOTEARS-MLP (Zheng et al., 2020), GOLEM (Ng et al., 2020), and No Fear (Wei et al., 2020) as the baselines for learning the SCM which represented as a DAG. On target prediction, we choose a standard MLP and CASTLE (Kyono et al., 2020) as the baseline methods. Metrics: We evaluate the estimated Y-related DAG and whole DAG structure using Structural Hamming Distance (SHD): the number of missing, falsely detected or reversed edges, lower the better. We evaluate the target (Y ) prediction accuracy in Mean Squared Error (MSE). We compute SHD and the errors for multiple times and report the mean value.
Researcher Affiliation	Collaboration	Yunhao Ge , , Sercan Ö. Arık , Jinsung Yoon , Ao Xu , Laurent Itti , and Tomas Pﬁster EMAIL, EMAIL Google Cloud AI, Sunnyvale, CA, USA University of Southern California, Los Angeles, CA, USA
Pseudocode	Yes	Algorithm 1: Supervised Invariant Structure Learning Input: Dataset D Output: DAG, Y predictor f(X) = h θY 1 (X) ... Algorithm 2: Self-Supervised Invariant Structure Learning Input: Dataset D Output: DAG
Open Source Code	Yes	We open-source our code at https://github.com/Aaron Xu9/ISL.git.1 1The implementation is available in https://github.com/Aaron Xu9/ISL.git
Open Datasets	Yes	We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. ... The Sachs dataset is for the discovery of protein signaling network on expression levels of diﬀerent proteins and phospholipids in human cells (Sachs et al., 2005), and is a popular benchmark for causal graph discovery, containing both observational and interventional data. ... http://lib.stat.cmu.edu/datasets/boston. https://link.springer.com/article/10.1023/A:1007421730016. https://www.science.org/doi/full/10.1126/science.1105809.
Dataset Splits	Yes	We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing (Binder et al., 1997; bos) and Insurance (Binder et al., 1997; ins) datasets. For each, we randomly split the train/validation/test with the proportion 0.8/0.1/0.1.
Hardware Specification	Yes	The time measurements were obtained on an Apple M1 Pro chip with 16GB of memory.
Software Dependencies	No	The paper mentions mathematical optimization methods like L-BFGS-B (Zhu et al., 1997) and clustering algorithms like K-means (Lloyd, 1982), but it does not specify any software libraries or frameworks used (e.g., PyTorch, TensorFlow, scikit-learn) along with their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup	Yes	We set a minimum edges number Emin and a maximum edges number Emax based on the dataset information. Usually, Emin is half of the number of nodes \|E\|/2 and Emax is 5\|E\|. We also set a range of threshold t [tmin, tmax] and a step size ts base on the value range of W. Usually we use tmin = min(W) and tmax = max(W). ... We choose the value of γ and βi that achieves the smallest target Y reconstruction on the validation set. We ﬁnd the parameters: γ = 1; β1 = 0.001; β2 = 0.01; β3 = 0.01; β4 = 0.01 as reasonable choices across many diﬀerent settings, although they are not extensively optimized.