reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Thinning for Convolution-Closed Distributions

Authors: Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable.
Researcher Affiliation	Academia	Anna Neufeld EMAIL Public Health Sciences Division Fred Hutchinson Cancer Center Seattle, WA 98109, USA Ameer Dharamshi EMAIL Department of Biostatistics University of Washington Seattle, WA 98195, USA Lucy L. Gao EMAIL Department of Statistics University of British Columbia Vancouver, British Columbia V6T 1Z4, Canada Daniela Witten EMAIL Departments of Statistics and Biostatistics University of Washington Seattle, WA 98195, USA
Pseudocode	Yes	Algorithm 1: Data thinning Algorithm 2: Multifold data thinning. Algorithm 3: Evaluating binomial principal components with negative log-likelihood loss Algorithm 4: Evaluating gamma clusters with negative log-likelihood loss
Open Source Code	Yes	An R package implementing data thinning and scripts to reproduce the results in this paper are available at https://anna-neufeld.github.io/datathin/.
Open Datasets	Yes	The data set is freely available from 10X Genomics, and was previously analyzed in the Guided Clustering Tutorial vignette (Hoﬀman et al., 2022) for the popular R package Seurat (Hao et al., 2021; Stuart et al., 2019; Satija et al., 2015).
Dataset Splits	Yes	Step 1: Split the data into a training set and a test set. Sample splitting: Randomly generate a set train {1, . . . , n} with \|train\| = ϵn. Data thinning: Apply Algorithm 1 to each Xi for i = 1, . . . , n. Let n Zi, X(1) i : i {1, . . . , n} o be the training set and let n Zi, X(2) i : i {1, . . . , n} o be the test set. We carry out each method using ϵ = 0.2, ϵ = 0.5, and ϵ = 0.8. Algorithm 3: Evaluating binomial principal components with negative log-likelihood loss Input : A positive integer K, a matrix X Zn d [0,r], where Xij ind. Binomial(r, pij), and positive scalars ϵ(train) and ϵ(test) = 1 ϵ(train) such that ϵ(train)r, ϵ(test)r Z>0. 1 Apply data thinning to X to obtain X(train) and X(test), where X(train) ij ind. Binomial ϵ(train)r, pij and X(test) ij ind. Binomial ϵ(test)r, pij .
Hardware Specification	No	No specific hardware details (like CPU/GPU models, memory, or cloud instance types) are provided for running the experiments. The paper focuses on methodology and results.
Software Dependencies	No	The paper mentions "An R package implementing data thinning" and refers to the "R package Seurat" and "R package sctransform" but does not specify version numbers for R or any of the packages. Thus, no specific ancillary software details with versions are provided.
Experiment Setup	Yes	In this section, we focus on the application of data thinning to cross-validation in two settings. ... Example 11 (Choosing the number of principal components on binomial data) We generate data with n = 250 observations and d = 100 dimensions. Speciﬁcally, for i = 1, . . . , n and j = 1, . . . , d, we generate Xij ind. Binomial(r, pij) where r = 100 and p is an unknown n d matrix of probabilities. We construct logit(p) as a rank-K = 10 matrix with singular values 5, 6, . . . , 14. ... Example 12 (Choosing the number of clusters on gamma data) We generate data sets X Rn d such that there are 100 observations in each of K clusters, for a total of n = 100K observations. ... We let Xij ind. Gamma(λ, θci,j), for i = 1, . . . , n and j = 1, . . . , d, where ci {1, 2, . . . , K } indexes the true cluster membership of the ith observation. The shape parameter λ is a known constant common across all clusters and all dimensions, whereas the rate parameter θ is an unknown K d matrix such that each cluster has its own d-dimensional rate parameter. We generate data under two regimes: (1) a small d, small K regime in which d = 2 and K = 4, and (2) a large d, large K regime in which d = 100 and K = 10. The values of λ and θ are provided in Section D. Algorithms 3 and 4 specify parameters like ϵ(train), ϵ(test), and the use of K-means and specific loss functions. Algorithm 3, Step 2: "Pseudo-counts prevent taking the logit of 0 or 1." Step 2: "We use the R function step with its default settings."