reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Objective Bayesian Nets for Integrating Consistent Datasets

Authors: Juergen Landes, Jon Williamson

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In order to compare OBN-c DS with the brute-force approach, we implemented both approaches in Matlab. In this section we describe this proof of concept implementation. The primary aim of this implementation was to test the extent to which OBN-c DS reduces the size of the optimisation problem, in comparison to the brute-force approach. Results conﬁrm that OBN-c DS does indeed enable OBN construction in situations in which the brute-force method is not feasible due to its computational complexity. While efﬁciency of implementation was not a primary goal, some results relating to run times are also reported below.
Researcher Affiliation	Academia	J urgen Landes JUERGEN EMAIL Munich Centre for Mathematical Philosophy Open Science Center Ludwig-Maximilians-Universit at Munich, Germany Jon Williamson EMAIL Department of Philosophy and Centre for Reasoning University of Kent Canterbury, United Kingdom
Pseudocode	Yes	Pseudo Code of OBN-c DS. Input: consistent datasets DS1, . . . , DSh. 1) For all i learn a Markov network structure Gi from DSi representing independences of P i . 2) Set overarching undirected graph G as the union of the Gi. 3) Compute a minimal triangulation GT of G. 4) Orientate GT to give DAG H. 5) For each vertex in H, determine its probability distribution conditional on its parents: a) For all vertices for which there exists a dataset which measures this vertex and all its parents determine conditional probabilities as described in Section 3.3.3. b) For all other vertices determine conditional probabilities by solving the optimisation problem speciﬁed below. Output: Objective Bayesian Net with DAG H and conditional probabilities as determined in Step 5.
Open Source Code	Yes	Appendix E. Matlab Code for the Examples We here give the Matlab code for our examples. First we provide the code for the m.ﬁle: function [x,fval,exitflag,output,lambda,grad,hessian] = maxent(N,Aineqinput,bineqinput,Aeqinput,beqinput) % Returns [P m] where P achives maximum entropy m % Input dimentions of the probabity function % and optional inequality constraints (Aineq,bineq) % and optional equality constraints (Aeq,beq) tic if nargin<=4 Aeqinput=[]; beqinput=[]; end if nargin<=2 Aineqinput=[]; bineqinput=[]; end x0 = ones(1,N)/N; Ai = eye(N).*-1; bi = zeros(1,N); Aineq = [Ai; Aineqinput]; bineq = [bi bineqinput]; Ae = ones(1,N); be = [1]; Aeq = [Ae; Aeqinput]; beq = [be beqinput]; lb = zeros(1,N); options = optimset; options = optimset(options, Display , off ); options = optimset(options, Fun Val Check , on ); options = optimset(options, Algorithm , interior-point , Max Iter , 1000); [x,fval,exitflag,output,lambda,grad,hessian] = ... fmincon(@Neg Entropy,x0,Aineq,bineq,Aeq,beq,lb,[],[],options); toc
Open Datasets	No	We thus created 3 datasets as follows. We ﬁrst speciﬁed a Bayesian network on a set of binary variables A1, A2, . . . , An. We used a density parameter d representing the number of arrows in the DAG. For ﬁxed d , we inserted arrows uniformly at random between variables. The orientation was ﬁxed by directing each arrow from the variable enumerated ﬁrst to that with the greater index. This ensured that the directed graph is acyclic. Unconditional and conditional probabilities of all variables were assigned uniformly at random in the interval [0, 1]. A single dataset was created by sampling from this Bayesian network. This Bayesian network was hidden for the remainder of computations. Next, we assigned every variable in V to a non-empty set of datasets in which it was measured. To obtain computationally interesting problems we ﬁxed A1 to be measured by DS1 and DS2; A2 to be measured by DS1 and DS3; and A3 to be measured by DS2 and DS3. The other variables were assigned uniformly at random to the seven non-empty subsets of the 3 datasets. We then created three clones of the sampled dataset. In all three clones we deleted the columns corresponding to measurements of variables not measured by this dataset. In this way we arrived at our collection of three consistent datasets. The sampled (complete) dataset was then hidden for the remainder of computations.
Dataset Splits	No	The paper describes generating synthetic datasets for a proof of concept. There is no mention of splitting these datasets into training, validation, or test sets; the focus is on integrating them.
Hardware Specification	No	Implementation and testing on Matlab was chosen over R due to a number of helpful routines in Matlab s Causal Explorer Toolkit.
Software Dependencies	No	Step 1 of OBN-c DS was carried out using the Matlab implementation of Tsamardinos et al. (2006). No other source code was taken off-the-shelf. The triangulation (Step 3) was implemented by writing a code for the simple P-Time triangulation algorithm presented by Berry (1999). Implementation of the orientation (Step 4) was achieved by applying Williamson (2005a, Theorem 5.1) as explained in Section 3.3.1. Step 5a was carried out by directly computing conditional probabilities from the datasets as explained in Section 3.3.3. Optimisation (Step 5b) was achieved by calling Matlab s optimisation routine fmincon . While Matlab and specific toolkits are mentioned, no version numbers are provided for reproducibility.
Experiment Setup	No	The paper describes creating synthetic datasets by specifying a Bayesian network with a density parameter and then sampling from it. It also details the assignment of variables to datasets. However, it does not provide specific experimental setup details like hyperparameters for model training, optimization settings (beyond 'Matlab s optimisation routine fmincon'), or learning rates, as the core methodology is about constructing Bayesian networks rather than training a machine learning model in the traditional sense. The implementation details focus on the steps of the algorithm rather than typical ML experiment setup parameters.