reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data

Authors: Didong Li, Andrew Jones, Sudipto Banerjee, Barbara E. Engelhardt

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.
Researcher Affiliation	Academia	Didong Li EMAIL Department of Biostatistics University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA Andrew Jones EMAIL Department of Computer Science, Princeton University Princeton, NJ 08540, USA Sudipto Banerjee EMAIL Department of Biostatistics University of California, Los Angeles Los Angeles, CA 90095, USA Barbara Engelhardt EMAIL Gladstone Institutes San Francisco, CA 94158, USA Department of Biomedical Data Science Stanford University Stanford, CA 94305, USA
Pseudocode	No	The paper describes mathematical models and inference procedures in detail but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	Our Git Hub repository is https://github.com/andrewcharlesjones/multi-group-GP. This repository contains downloadable code for the models and experiments to reproduce the analysis in the paper. We provide a Python package for model ﬁtting, computing covariance functions and carrying out estimation and prediction.
Open Datasets	Yes	We applied the MGGP to a large gene expression data set collected by the Genotype Tissue Expression (GTEx) project (Consortium et al., 2020). The GTEx data can be downloaded from the GTEx portal: https://gtexportal.org/ home/datasets.
Dataset Splits	Yes	We ﬁt these models to each of the data sets using 50% of the data for training, and we test our predictions over the remaining data.
Hardware Specification	Yes	Experiments were run on an internal computing cluster using a 320 NVIDIA P100 Graphical Processing Unit.
Software Dependencies	No	The paper mentions using Python, JAX software framework (Bradbury et al., 2018), and the Stan programming environment (Stan Development Team, 2020; Riddell et al., 2021) but does not provide specific version numbers for JAX or Stan, which are key components for reproducibility.
Experiment Setup	Yes	With θ = {a, b, σ2}, the prior distribution in Equation (3) is speciﬁed as p({τ 2j }, θ, β) = IG(a \| αa, αa) IG(b \| αb, αb) IG(σ2 \| ασ, ασ) k j=1 IG(τ 2j \| ατj , ατj ) N(β \| µβ, Vβ), (10) where we set αa = αb = ατ1 = ατ1 = ατ2 = ατ2 = 5, ασ = ασ = 1, µβ = 0 and V 1β = I. ... We ran four chains with dispersed initial values for 1, 200 iterations each. Convergence was diagnosed after 200 iterations using visual inspection of autocorrelation plots (Figure 10) and computation of Gelman-Rubin R-hat and Monte Carlo standard errors. The subsequent 4, 000 samples were retained for posterior inference.