reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Network Learning via Topological Order

Authors: Young Woong Park, Diego Klabjan

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A computational experiment is presented for the Gaussian Bayesian network learning problem, an optimization problem minimizing the sum of squared errors of regression models with L1 penalty over a feature network with application of gene network inference in bioinformatics. ... In the computational experiment, we compare the performance of the proposed MIP model and algorithms against the algorithm in Han et al. (2016) and other available MIP models for synthetic and real instances.
Researcher Affiliation	Academia	Young Woong Park EMAIL College of Business Iowa State University Ames, IA 50011, USA Diego Klabjan EMAIL Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208, USA
Pseudocode	Yes	Algorithm 1 TOSA (Topological Order Swapping Algorithm) Algorithm 2 IR (Iterative Reordering) Algorithm 3 Greedy Algorithm 4 GD (Gradient Descent)
Open Source Code	No	The paper mentions that "the R script of the original algorithm is available on the journal website" referring to the benchmark algorithm from Han et al. (2016), but it does not provide any statement or link for the open-sourcing of the code for the methodology described in this paper.
Open Datasets	Yes	We first test all algorithms with synthetic instances generated using R package pcalg (Kalisch et al., 2012). ... Finally, in Section 5.4, we solve a popular real instance of Sachs et al. (2005) in the literature. The data set has been studied in many works including Friedman et al. (2008), Shojaie and Michailidis (2010), Fu and Zhou (2013), and Aragam and Zhou (2015).
Dataset Splits	No	The paper describes generating synthetic data and using a real-world flow cytometry dataset. For synthetic data, it details the generation process but not explicit train/test/validation splits. For the real data, it states "The data set has ... n = 7466 cells obtained from multiple experiments with m = 11 measurements" but does not specify how this data was split for training, testing, or validation purposes in their experiments.
Hardware Specification	Yes	For all computational experiments, a server with two Xeon 2.70GHz CPUs and 24GB RAM is used.
Software Dependencies	Yes	The MIP models MIPcp, MIPin, and MIPto are implemented with CPLEX 12.6 in C#. For GD and IR, the algorithms are written in R (R Core Team, 2016). We use glmnet package (Friedman et al., 2010) function glmnet for solving LASSO linear regression problems in (16). Function random DAG is used to generate a DAG and function rmv DAG is used to generate multivariate data with the standard normal error distribution. First, a DAG is generated by random DAG function. Next, the generated DAG and random coefficients are used to create each column (with standard normal error added) by rmv DAG function which uses linear regression as the underlying model. After obtaining the data matrix from the package, we standardize each column to have zero mean with standard deviation equal to one. The DAG used to generate the multivariate data is considered as the true structure or true arc set while it may not be the optimal solution for the score function. The random instances are generated for various parameters described in the following.
Experiment Setup	Yes	For IR, we use parameters α = 0.01, t = 10, νlb = 0.8, and νub = 1.2. For GD, we use parameters α = 0.01, t 1 = 10, and t 2 = 5. ... We use four λ values differently defined for each data set in order to cover the expected number of arcs with the four λ values. For each sparse instance, we solve (14) with λ {1, 0.5, 0.1, 0.05}. For dense data sets, a wide range of λ values are needed to obtain selected arc sets that have similar cardinalities with the true arc sets. Hence, for each dense instance, instead of fixed values over all instances in the set, we use λ values based on expected density d: λ = λ0 10 (10 d 1), where λ0 {1, 0.1, 0.01, 0.001}. For each high dimensional instance, we use λ {1, 0.8, 0.6, 0.4}.