Bayesian group factor analysis with structured sparsity
Authors: Shiwen Zhao, Chuan Gao, Sayan Mukherjee, Barbara E Engelhardt
JMLR 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on simulated data with substantial structure. We show results of our method applied to three high-dimensional data sets, comparing results against a number of state-of-the-art approaches. In Section 6, we show the behavior of our model for recovering simulated sparse signals among m observation matrices and compare the results from BASS with state-of-the-art methods. In Section 7, we present results that illustrate the performance of BASS on three high-dimensional data sets. |
| Researcher Affiliation | Academia | Shiwen Zhao EMAIL Computational Biology and Bioinformatics Program Department of Statistical Science Duke University Durham, NC 27708, USA; Chuan Gao EMAIL Department of Statistical Science Duke University Durham, NC 27708, USA; Sayan Mukherjee EMAIL Departments of Statistical Science, Computer Science, Mathematics Duke University Durham, NC 27708, USA; Barbara E Engelhardt EMAIL Department of Computer Science Center for Statistics and Machine Learning Princeton University Princeton, NJ 08540, USA |
| Pseudocode | No | The paper includes appendices that describe the MCMC, EM, and PX-EM algorithms with mathematical derivations and explanations of conditional distributions and parameter updates, but these are not presented in a structured pseudocode or algorithm block format with numbered steps. |
| Open Source Code | Yes | All code and data are publicly available. The software for BASS is available at https://github.com/judyboon/BASS. |
| Open Datasets | Yes | The Mulan Library consists of multiple data sets collected for the purpose of evaluating multi-label predictions (Tsoumakas et al., 2011). We applied our BASS model to gene expression data from the Cholesterol and Pharmacogenomic (CAP) study, consisting of expression measurements for 10, 195 genes in 480 lymphoblastoid cell lines (LCLs) after 24-hour exposure to either a control buffer (Y (1)) or 2µM simvastatin acid (Y (2)) (Mangravite et al., 2013; Brown et al., 2013). The gene expression data were acquired through Gene Expression Omnibus (GEO) Accession number GSE36868. In this application, we used BASS and related methods for multiclass classification in the 20 Newsgroups data (Joachims, 1997). |
| Dataset Splits | Yes | For the six simulations, we used the simulated data as training data for training sample sizes nt = {30, 50}, and, additionally, simulated data sets with training sample sizes nt = {10, 100, 200}. Then, we generated ns = 200 samples as test data using the true model parameters, simulating the corresponding test data factors X N(0, 1). We held out 10 documents at random from each newsgroup as test data (Table S14). |
| Hardware Specification | No | The paper discusses the computational complexity of the algorithms but does not specify any particular hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several software packages and libraries such as R packages (CCA, PMA, Presence Absence) and Matlab packages (vargplvm, GPLVM) and the scikit-learn Python package. However, it does not provide specific version numbers for any of these software components, which is required for reproducibility. |
| Experiment Setup | Yes | In Sim1 and Sim3, we set the initial number of factors to k = 10. In Sim2, Sim4, Sim5, and Sim6, we set the initial number of factors to 15. We performed 20 runs for each version of inference in BASS: EM, MCMC-EM, and PX-EM. The hyperparameters of the global-factor-local T PB prior to a = b = c = d = e = f = 0.5, which recapitulates the horseshoe prior at all three levels of the hierarchy. The hyperparameters for the error variances, aσ and bσ, were set to 1 and 0.3 respectively. For s GFA, the ARD prior was set to Ga(10 3, 10 3), the prior on inclusion probabilities to beta(1, 1), total MCMC iterations to 105 with sampling iterations of 1,000 and thinning steps of 5. For GFA, the ARD prior for both loading and error variance was set to Ga(10 14, 10 14), maximum iterations to 105, and L-BFGS optimization. For RCCA, regularization parameters were chosen using leave-one-out cross-validation on an 11 x 11 grid from 0.0001 to 0.01. For SCCA, the ℓ1 bound of the projection vector was set to 0.3 pw for w = 1, 2. For JFA, ARD priors were Ga(10 5, 10 5), beta process prior parameters α = 0.1 and c = 104, MCMC iterations 1,000 with 200 burn-in. For MRD, the linar2 kernel was chosen for all observations. |