reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian group factor analysis with structured sparsity

Authors: Shiwen Zhao, Chuan Gao, Sayan Mukherjee, Barbara E Engelhardt

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method on simulated data with substantial structure. We show results of our method applied to three high-dimensional data sets, comparing results against a number of state-of-the-art approaches. In Section 6, we show the behavior of our model for recovering simulated sparse signals among m observation matrices and compare the results from BASS with state-of-the-art methods. In Section 7, we present results that illustrate the performance of BASS on three high-dimensional data sets.
Researcher Affiliation	Academia	Shiwen Zhao EMAIL Computational Biology and Bioinformatics Program Department of Statistical Science Duke University Durham, NC 27708, USA; Chuan Gao EMAIL Department of Statistical Science Duke University Durham, NC 27708, USA; Sayan Mukherjee EMAIL Departments of Statistical Science, Computer Science, Mathematics Duke University Durham, NC 27708, USA; Barbara E Engelhardt EMAIL Department of Computer Science Center for Statistics and Machine Learning Princeton University Princeton, NJ 08540, USA
Pseudocode	No	The paper includes appendices that describe the MCMC, EM, and PX-EM algorithms with mathematical derivations and explanations of conditional distributions and parameter updates, but these are not presented in a structured pseudocode or algorithm block format with numbered steps.
Open Source Code	Yes	All code and data are publicly available. The software for BASS is available at https://github.com/judyboon/BASS.
Open Datasets	Yes	The Mulan Library consists of multiple data sets collected for the purpose of evaluating multi-label predictions (Tsoumakas et al., 2011). We applied our BASS model to gene expression data from the Cholesterol and Pharmacogenomic (CAP) study, consisting of expression measurements for 10, 195 genes in 480 lymphoblastoid cell lines (LCLs) after 24-hour exposure to either a control buﬀer (Y (1)) or 2µM simvastatin acid (Y (2)) (Mangravite et al., 2013; Brown et al., 2013). The gene expression data were acquired through Gene Expression Omnibus (GEO) Accession number GSE36868. In this application, we used BASS and related methods for multiclass classiﬁcation in the 20 Newsgroups data (Joachims, 1997).
Dataset Splits	Yes	For the six simulations, we used the simulated data as training data for training sample sizes nt = {30, 50}, and, additionally, simulated data sets with training sample sizes nt = {10, 100, 200}. Then, we generated ns = 200 samples as test data using the true model parameters, simulating the corresponding test data factors X N(0, 1). We held out 10 documents at random from each newsgroup as test data (Table S14).
Hardware Specification	No	The paper discusses the computational complexity of the algorithms but does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions several software packages and libraries such as R packages (CCA, PMA, Presence Absence) and Matlab packages (vargplvm, GPLVM) and the scikit-learn Python package. However, it does not provide specific version numbers for any of these software components, which is required for reproducibility.
Experiment Setup	Yes	In Sim1 and Sim3, we set the initial number of factors to k = 10. In Sim2, Sim4, Sim5, and Sim6, we set the initial number of factors to 15. We performed 20 runs for each version of inference in BASS: EM, MCMC-EM, and PX-EM. The hyperparameters of the global-factor-local T PB prior to a = b = c = d = e = f = 0.5, which recapitulates the horseshoe prior at all three levels of the hierarchy. The hyperparameters for the error variances, aσ and bσ, were set to 1 and 0.3 respectively. For s GFA, the ARD prior was set to Ga(10 3, 10 3), the prior on inclusion probabilities to beta(1, 1), total MCMC iterations to 105 with sampling iterations of 1,000 and thinning steps of 5. For GFA, the ARD prior for both loading and error variance was set to Ga(10 14, 10 14), maximum iterations to 105, and L-BFGS optimization. For RCCA, regularization parameters were chosen using leave-one-out cross-validation on an 11 x 11 grid from 0.0001 to 0.01. For SCCA, the ℓ1 bound of the projection vector was set to 0.3 pw for w = 1, 2. For JFA, ARD priors were Ga(10 5, 10 5), beta process prior parameters α = 0.1 and c = 104, MCMC iterations 1,000 with 200 burn-in. For MRD, the linar2 kernel was chosen for all observations.