Data Thinning for Convolution-Closed Distributions

Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
Researcher Affiliation Academia Anna Neufeld EMAIL Public Health Sciences Division Fred Hutchinson Cancer Center Seattle, WA 98109, USA Ameer Dharamshi EMAIL Department of Biostatistics University of Washington Seattle, WA 98195, USA Lucy L. Gao EMAIL Department of Statistics University of British Columbia Vancouver, British Columbia V6T 1Z4, Canada Daniela Witten EMAIL Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA
Pseudocode Yes Algorithm 1: Data thinning Algorithm 2: Multifold data thinning. Algorithm 3: Evaluating binomial principal components with negative log-likelihood loss Algorithm 4: Evaluating gamma clusters with negative log-likelihood loss
Open Source Code Yes An R package implementing data thinning and scripts to reproduce the results in this paper are available at https://anna-neufeld.github.io/datathin/.
Open Datasets Yes The data set is freely available from 10X Genomics, and was previously analyzed in the Guided Clustering Tutorial vignette (Hoffman et al., 2022) for the popular R package Seurat (Hao et al., 2021; Stuart et al., 2019; Satija et al., 2015).
Dataset Splits Yes Step 1: Split the data into a training set and a test set. Sample splitting: Randomly generate a set train {1, . . . , n} with |train| = ϵn. Data thinning: Apply Algorithm 1 to each Xi for i = 1, . . . , n. Let n Zi, X(1) i : i {1, . . . , n} o be the training set and let n Zi, X(2) i : i {1, . . . , n} o be the test set. We carry out each method using ϵ = 0.2, ϵ = 0.5, and ϵ = 0.8. Algorithm 3: Evaluating binomial principal components with negative log-likelihood loss Input : A positive integer K, a matrix X Zn d [0,r], where Xij ind. Binomial(r, pij), and positive scalars ϵ(train) and ϵ(test) = 1 ϵ(train) such that ϵ(train)r, ϵ(test)r Z>0. 1 Apply data thinning to X to obtain X(train) and X(test), where X(train) ij ind. Binomial ϵ(train)r, pij and X(test) ij ind. Binomial ϵ(test)r, pij .
Hardware Specification No No specific hardware details (like CPU/GPU models, memory, or cloud instance types) are provided for running the experiments. The paper focuses on methodology and results.
Software Dependencies No The paper mentions "An R package implementing data thinning" and refers to the "R package Seurat" and "R package sctransform" but does not specify version numbers for R or any of the packages. Thus, no specific ancillary software details with versions are provided.
Experiment Setup Yes In this section, we focus on the application of data thinning to cross-validation in two settings. ... Example 11 (Choosing the number of principal components on binomial data) We generate data with n = 250 observations and d = 100 dimensions. Specifically, for i = 1, . . . , n and j = 1, . . . , d, we generate Xij ind. Binomial(r, pij) where r = 100 and p is an unknown n d matrix of probabilities. We construct logit(p) as a rank-K = 10 matrix with singular values 5, 6, . . . , 14. ... Example 12 (Choosing the number of clusters on gamma data) We generate data sets X Rn d such that there are 100 observations in each of K clusters, for a total of n = 100K observations. ... We let Xij ind. Gamma(λ, θci,j), for i = 1, . . . , n and j = 1, . . . , d, where ci {1, 2, . . . , K } indexes the true cluster membership of the ith observation. The shape parameter λ is a known constant common across all clusters and all dimensions, whereas the rate parameter θ is an unknown K d matrix such that each cluster has its own d-dimensional rate parameter. We generate data under two regimes: (1) a small d, small K regime in which d = 2 and K = 4, and (2) a large d, large K regime in which d = 100 and K = 10. The values of λ and θ are provided in Section D. Algorithms 3 and 4 specify parameters like ϵ(train), ϵ(test), and the use of K-means and specific loss functions. Algorithm 3, Step 2: "Pseudo-counts prevent taking the logit of 0 or 1." Step 2: "We use the R function step with its default settings."