reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Graphical Models With Hubs

Authors: Kean Ming Tan, Palma London, Karthik Mohan, Su-In Lee, Maryam Fazel, Daniela Witten

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On synthetic data, we demonstrate that our proposed framework outperforms competitors that do not explicitly model hub nodes. We illustrate our proposal on a webpage data set and a gene expression data set. ... In this section, we compare HGL to two sets of proposals: proposals that learn an Erd os R enyi Gaussian graphical model, and proposals that learn a Gaussian graphical model in which some nodes are highly-connected. ... In this section, we present the results for the simulation study described in Section 4.2 with n = 100, p = 200, and \|H\| = 4. We calculate the proportion of correctly estimated hub nodes with r = 40. The results are shown in Figure 10.
Researcher Affiliation	Academia	Kean Ming Tan EMAIL Department of Biostatistics University of Washington Seattle WA, 98195 Palma London EMAIL Karthik Mohan EMAIL Department of Electrical Engineering University of Washington Seattle WA, 98195 Su-In Lee EMAIL Department of Computer Science and Engineering, Genome Sciences University of Washington Seattle WA, 98195 Maryam Fazel EMAIL Department of Electrical Engineering University of Washington Seattle WA, 98195 Daniela Witten EMAIL Department of Biostatistics University of Washington Seattle, WA 98195
Pseudocode	Yes	Algorithm 1 ADMM Algorithm for Solving (3). 1. Initialize the parameters: (a) primal variables Θ, V, Z, Θ, V, and Z to the p p identity matrix. (b) dual variables W1, W2, and W3 to the p p zero matrix. (c) constants ρ > 0 and τ > 0. 2. Iterate until the stopping criterion Θt Θt 1 2 F Θt 1 2 F τ is met, where Θt is the value of Θ obtained at the tth iteration: (a) Update Θ, V, Z: i. Θ = arg min Θ S n ℓ(X, Θ) + ρ 2 Θ Θ + W1 2 F o . ii. Z = S( Z W3, λ1 ρ ), diag(Z) = diag( Z W3). Here S denotes the soft-thresholding operator, applied element-wise to a matrix: S(Aij, b) = sign(Aij) max(\|Aij\| b, 0). iii. C = V W2 diag( V W2). iv. Vj = max 1 λ3 ρ S(Cj,λ2/ρ) 2 , 0 S(Cj, λ2/ρ) for j = 1, . . . , p. v. diag(V) = diag( V W2). (b) Update Θ, V, Z: 6 (Θ + W1) (V + W2) (V + W2)T (Z + W3) . ii. Θ = Θ + W1 1 ρΓ; iii. V = 1 ρ(Γ + ΓT ) + V + W2; iv. Z = 1 ρΓ + Z + W3. (c) Update W1, W2, W3: i. W1 = W1 + Θ Θ; ii. W2 = W2 + V V; iii. W3 = W3 + Z Z.
Open Source Code	Yes	An R package hglasso is publicly available on the authors websites and on CRAN.
Open Datasets	Yes	We illustrate our proposal on a webpage data set and a gene expression data set. ... We applied HGL to the university webpage data set from the World Wide Knowledge Base project at Carnegie Mellon University. This data set was pre-processed by Cardoso-Cachopo (2009). ... We applied HGL to a publicly available cancer gene expression data set (Verhaak et al., 2010).
Dataset Splits	No	The paper describes the generation of synthetic data and the characteristics of real-world datasets used (e.g., number of variables p, number of observations n) but does not provide details on specific training/test/validation splits for these datasets. For synthetic data, it refers to averaging results over '100 simulated data sets', which relates to repetitions rather than train/test splits within a single experiment.
Hardware Specification	Yes	On a 1.86 GHz Intel Core 2 Duo machine, the interior point method takes 3 minutes, while ADMM takes only 1 second, on a data set with p = 30. ... We ran experiments with p = 100, 200, 300 and with n = p/2 on a 2.26GHz Intel Core 2 Duo machine.
Software Dependencies	No	The graphical lasso (5), implemented using the R package glasso. ... The neighborhood selection approach of Meinshausen and B uhlmann (2006), implemented using the R package glasso. ... Sparse partial correlation estimation procedure of Peng et al. (2009), implemented using the R package space. ... We compare the performance of HBN to the proposal of H oﬂing and Tibshirani (2009), implemented using the R package BMN. ... The paper mentions specific R packages used for implementing various methods (glasso, spcov, space, BMN) but does not provide specific version numbers for these packages or the R environment itself.
Experiment Setup	Yes	To obtain the curves shown in Figure 3, we ﬁxed λ1 = 0.4, considered three values of λ3 (each shown in a diﬀerent color in Figure 3), and used a ﬁne grid of values of λ2. ... We ﬁxed λ1 = 0.2, considered three values of λ3 (each shown in a diﬀerent color), and varied λ2 in order to obtain the curves shown in Figure 6. ... For HBN, we ﬁxed λ1 = 5, considered λ3 = {15, 25, 30}, and used a ﬁne grid of values of λ2. ... we ﬁx the tuning parameter that controls the sparsity of Z at λ1 = 0.45 ... we ﬁx λ3 = 1.5 ... select a value of λ2 ranging from 0.1 to 0.5 ... We performed HGL with the selected tuning parameters λ1 = 0.45, λ2 = 0.25, and λ3 = 1.5. ... Since we are interested in identifying hub genes, and not as interested in identifying edges between non-hub nodes, we ﬁx λ1 = 0.6 ... We ﬁx λ3 = 6.5 ... select λ2 ranging from 0.1 to 0.7 ... We applied HGL with this set of tuning parameters ... λ1 = 0.6, λ2 = 0.4, λ3 = 6.5.