reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distinguishing Cause and Effect in Bivariate Structural Causal Models: A Systematic Investigation

Authors: Christoph Käding,, Jakob Runge,

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a detailed and systematic comparison of a range of methods on current real-world data and a novel collection of data sets that systematically models individual data challenges. Our evaluation also covers more recent methods missing in previous studies. The aim is to assist users in finding the most suitable methods for their problem setting and for method developers to identify weaknesses of current methods to improve them or develop new methods.
Researcher Affiliation	Academia	Christoph Kading EMAIL Institute of Data Science German Aerospace Center (DLR) 07745 Jena, Germany Jakob Runge EMAIL Institute of Data Science German Aerospace Center (DLR) 07745 Jena, Germany and Technische Universit at Berlin 10623 Berlin, Germany
Pseudocode	No	The paper conceptually describes various methods and the experimental setup. While Appendix D provides an 'in-depth evaluation of individual methods' and outlines their logic, it does so in narrative text rather than structured pseudocode blocks or algorithms.
Open Source Code	No	The paper states: 'The novel suite of data sets will be contributed to the causeme.net benchmark platform to provide a continuously updated and searchable causal discovery method intercomparison database.' and 'the data collection and results will be published and further extended on causeme.net.'. This indicates the data and results are available. However, the paper explicitly refers to using other authors' publicly available code for the methods being evaluated (e.g., 'We use the publicly available code contained in the causal discovery toolbox (Kalainathan and Goudet, 2019)' for ANM, 'We obtained our results using the code provided at bitbucket.org/dhernand/gr_causal_inference' for EMD, etc.), but it does not provide a specific repository link or an explicit statement about the release of their own code for the systematic investigation and benchmarking framework developed in this paper.
Open Datasets	Yes	The novel suite of data sets will be contributed to the causeme.net benchmark platform to provide a continuously updated and searchable causal discovery method intercomparison database. Valuable works into this direction were already made, for example, by Mooij et al. (2016) with the T ubingen Cause Effect Pairs data set (TCEP, detailed introduction in Section 3.1.1) or by Guyon (2013) with the Cause Effect Pairs Challenge (CEC13, detailed introduction in Section 3.1.2). The data can be found at webdav.tuebingen.mpg.de/cause-effect/. The whole data is now freely available at www.causality.inf.ethz.ch/cause-effect.php The data can be found at dx.doi.org/10.7910/DVN/3757KX.
Dataset Splits	No	The paper states: 'Then, each of these 828 valid setups is sampled with a sample size of 100 and 1 000. Finally, to reliably estimate the accuracy for a given configuration and sample size, we generate 100 realizations each.' This describes how their novel synthetic data was generated and the number of realizations, but it does not specify explicit training, validation, or test dataset splits for machine learning models within their evaluation framework. For existing datasets, it mentions using 'final test data' for CEC13 but does not detail how these splits were defined or whether they used any custom splits for their analysis.
Hardware Specification	No	However, we are not able to give a fair comparison w.r.t. runtimes. We obtained our results over a long period of time using a diverse set of hardware. Further, the code obtained from the authors is written in different languages and may or may not be optimized for multi-threading or usage of GPUs.
Software Dependencies	No	The paper mentions several software components, including Python libraries like 'SHAP', 'XGBoost', 'scipy.stats.mannwhitneyu', 'Tigramite', 'sklearn.neighbors.Local Outlier Factor', 'scipy.stats.gaussian_kde', and 'matplotlib'. However, it does not provide specific version numbers for any of these libraries or tools, which is required for a reproducible description of ancillary software.
Experiment Setup	No	Our approach is to evaluate the methods as is , i.e., as they are provided in the method papers, either by the code supplied by the authors or by established publicly available implementations. We only ensure that the data is preprocessed as suggested by the authors, i.e., centered, standardized, etc. For the meta-analysis: 'We realize this hidden model by gradient boosting regression trees3 that are known to be a reasonable choice for tabular data (Grinsztajn et al., 2022). In particular, we translate each characteristic into a one-hot encoding (even mutual information and sample size for consistency reasons) and train an auxiliary regressor to estimate the obtained real-valued accuracy from the individual data configuration and the used bivariate causal discovery method as categorical inputs.' The paper explicitly states that the evaluated methods were run 'as is' with default parameters, implying no specific hyperparameters were set by the authors of this paper for those methods. While a meta-analysis uses a regressor, no specific hyperparameters for this regressor (e.g., learning rate, number of trees for XGBoost) are provided.