Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data
Authors: Didong Li, Andrew Jones, Sudipto Banerjee, Barbara E. Engelhardt
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables. |
| Researcher Affiliation | Academia | Didong Li EMAIL Department of Biostatistics University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA Andrew Jones EMAIL Department of Computer Science, Princeton University Princeton, NJ 08540, USA Sudipto Banerjee EMAIL Department of Biostatistics University of California, Los Angeles Los Angeles, CA 90095, USA Barbara Engelhardt EMAIL Gladstone Institutes San Francisco, CA 94158, USA Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA |
| Pseudocode | No | The paper describes mathematical models and inference procedures in detail but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | Our Git Hub repository is https://github.com/andrewcharlesjones/multi-group-GP. This repository contains downloadable code for the models and experiments to reproduce the analysis in the paper. We provide a Python package for model fitting, computing covariance functions and carrying out estimation and prediction. |
| Open Datasets | Yes | We applied the MGGP to a large gene expression data set collected by the Genotype Tissue Expression (GTEx) project (Consortium et al., 2020). The GTEx data can be downloaded from the GTEx portal: https://gtexportal.org/ home/datasets. |
| Dataset Splits | Yes | We fit these models to each of the data sets using 50% of the data for training, and we test our predictions over the remaining data. |
| Hardware Specification | Yes | Experiments were run on an internal computing cluster using a 320 NVIDIA P100 Graphical Processing Unit. |
| Software Dependencies | No | The paper mentions using Python, JAX software framework (Bradbury et al., 2018), and the Stan programming environment (Stan Development Team, 2020; Riddell et al., 2021) but does not provide specific version numbers for JAX or Stan, which are key components for reproducibility. |
| Experiment Setup | Yes | With θ = {a, b, σ2}, the prior distribution in Equation (3) is specified as p({τ 2j }, θ, β) = IG(a | αa, αa) IG(b | αb, αb) IG(σ2 | ασ, ασ) k j=1 IG(τ 2j | ατj , ατj ) N(β | µβ, Vβ), (10) where we set αa = αb = ατ1 = ατ1 = ατ2 = ατ2 = 5, ασ = ασ = 1, µβ = 0 and V 1β = I. ... We ran four chains with dispersed initial values for 1, 200 iterations each. Convergence was diagnosed after 200 iterations using visual inspection of autocorrelation plots (Figure 10) and computation of Gelman-Rubin R-hat and Monte Carlo standard errors. The subsequent 4, 000 samples were retained for posterior inference. |