Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data

Authors: Didong Li, Andrew Jones, Sudipto Banerjee, Barbara E. Engelhardt

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.
Researcher Affiliation Academia Didong Li EMAIL Department of Biostatistics University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA Andrew Jones EMAIL Department of Computer Science, Princeton University Princeton, NJ 08540, USA Sudipto Banerjee EMAIL Department of Biostatistics University of California, Los Angeles Los Angeles, CA 90095, USA Barbara Engelhardt EMAIL Gladstone Institutes San Francisco, CA 94158, USA Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA
Pseudocode No The paper describes mathematical models and inference procedures in detail but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes Our Git Hub repository is https://github.com/andrewcharlesjones/multi-group-GP. This repository contains downloadable code for the models and experiments to reproduce the analysis in the paper. We provide a Python package for model fitting, computing covariance functions and carrying out estimation and prediction.
Open Datasets Yes We applied the MGGP to a large gene expression data set collected by the Genotype Tissue Expression (GTEx) project (Consortium et al., 2020). The GTEx data can be downloaded from the GTEx portal: https://gtexportal.org/ home/datasets.
Dataset Splits Yes We fit these models to each of the data sets using 50% of the data for training, and we test our predictions over the remaining data.
Hardware Specification Yes Experiments were run on an internal computing cluster using a 320 NVIDIA P100 Graphical Processing Unit.
Software Dependencies No The paper mentions using Python, JAX software framework (Bradbury et al., 2018), and the Stan programming environment (Stan Development Team, 2020; Riddell et al., 2021) but does not provide specific version numbers for JAX or Stan, which are key components for reproducibility.
Experiment Setup Yes With θ = {a, b, σ2}, the prior distribution in Equation (3) is specified as p({τ 2j }, θ, β) = IG(a | αa, αa) IG(b | αb, αb) IG(σ2 | ασ, ασ) k j=1 IG(τ 2j | ατj , ατj ) N(β | µβ, Vβ), (10) where we set αa = αb = ατ1 = ατ1 = ατ2 = ατ2 = 5, ασ = ασ = 1, µβ = 0 and V 1β = I. ... We ran four chains with dispersed initial values for 1, 200 iterations each. Convergence was diagnosed after 200 iterations using visual inspection of autocorrelation plots (Figure 10) and computation of Gelman-Rubin R-hat and Monte Carlo standard errors. The subsequent 4, 000 samples were retained for posterior inference.