reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Estimating Causal Structure Using Conditional DAG Models

Authors: Chris. J. Oates, Jim Q. Smith, Sach Mukherjee

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results demonstrate gains compared with formulations that treat all variables on an equal footing, or that ignore secondary variables. The methodology is motivated by applications in biology that involve multiple data types and is illustrated here using simulated data and in an analysis of molecular data from the Cancer Genome Atlas.
Researcher Affiliation	Academia	Chris. J. Oates EMAIL School of Mathematical and Physical Sciences University of Technology Sydney NSW 2007, Australia Jim Q. Smith EMAIL Department of Statistics University of Warwick Coventry, CV4 7AL, UK Sach Mukherjee EMAIL German Center for Neurodegenerative Diseases (DZNE) 53175 Bonn, Germany
Pseudocode	No	The paper describes the Integer Linear Programming (ILP) approach in detail with mathematical formulations, constraints, and propositions in Section 2.6, but it does not present a distinct, clearly labeled pseudocode or algorithm block.
Open Source Code	No	For the applications in this paper, all ILP instances were solved using the GOBNILP software that is freely available to download from http://www.cs.york.ac.uk/aig/sw/gobnilp/. This refers to a third-party software used, not open-source code for the specific methodology developed by the authors.
Open Datasets	Yes	The methodology is motivated by applications in biology that involve multiple data types and is illustrated here using simulated data and in an analysis of molecular data from the Cancer Genome Atlas. The data we analyse are from the TCGA pan-cancer project (Akbani et al., 2014)
Dataset Splits	No	For simulated data, the paper states: "we report the mean SHD as computed over 10 independent realisations of the data." For molecular data: "We focus on p = 24 proteins... The data span eight diﬀerent cancer types... with a total sample size of n = 3,467 patients." Neither provides specific training/test/validation splits or their percentages, counts, or explicit methodology for data partitioning.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions "all ILP instances were solved using the GOBNILP software" but does not specify a version number for this software.
Experiment Setup	Yes	We construct a linear model for the observations Yl j = [1 Xl j]β0 + Yl πβπ + ϵl j, ϵl j N(0, σ2) (...). For the parameter prior pj,π( βπ\| β0, σ) we use the g-prior (Zellner, 1986) βπ\| β0, j, π N(0, gσ2(M T π Mπ) 1) where g is a positive constant to be speciﬁed. (...). Let g = n. (...). For all estimators we considered only models of size \|π\| 5.