reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process

Authors: Cheng Zeng, Jeffrey W Miller, Leo L Duan

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters. Keywords: Consistent Clustering, Exchangeable Partition Probability Function, Sparse Simplex.
Researcher Affiliation	Academia	Cheng Zeng EMAIL Department of Statistics University of Florida Gainesville, FL 32611, USA Jeffrey W Miller EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA Leo L Duan EMAIL Department of Statistics University of Florida Gainesville, FL 32611, USA
Pseudocode	Yes	3. Posterior Sampling Algorithm Since the quasi-Bernoulli mixture model involves a small modiﬁcation to classic stickbreaking construction, we can use an eﬃcient slice sampling algorithm [Kalli et al. (2011), as the improved version of Walker (2007)] for posterior inference. We use a sequence of decreasing positive constants ξ1, ξ2, . . . that converges to zero. In this article, we choose ξi = 0.5i for i ≥ 1 as suggested in Kalli et al. (2011). Given ci (the component assignment), consider a latent uniform ui Uniform(0, ξci), then we have a joint likelihood proportional to Πn i=1 1(ui < ξci)wci/ξcifθci(yi). We deﬁne the state of the Markov chain to be (c, θ, w, u) and the target distribution is the posterior p(c, θ, w, u \| y), where y = y1:n, c = c1:n, θ = θ1:∞, and w = w1:∞. The slice sampler iterates the following steps: 1. Sample c from its full conditional. For i = 1, . . . , n: sample ci Categorical( wk) where wk = wk/ξkfθk(yi) / P {l:ξl>ui} wl/ξlfθl(yi), for k ∈ {l : ξl > ui}. Since the sequence ξ1, ξ2, . . . converges to zero, the index set {l : ξl > ui} is ﬁnite. Compute nk := P i 1(ci = k), and mk := P i 1(ci > k). 2. Sample u from its full conditional. For i = 1, . . . , n: sample ui from the uniform distribution over the interval (0, ξci). 3. Sample w from its full conditional. For k ∈ n i=1{l : ξl > ui}: Sample bk ∼ qδ1( ) + (1 − q)δϵ( ) where q = p / (p + (1 − p)ϵαIϵ(mk + α, nk + 1)). Sample βk by drawing X ∼ Beta(0,bk)(mk + α, nk + 1) and setting βk = X/bk, where Beta(0,ϵ) denotes a Beta distribution truncated to the interval (0, ϵ). Compute wk from b1:k and β1:k using Equation (2). 4. Sample θ from its full conditional. For k ∈ n i=1{l : ξl > ui}: sample θk from the distribution proportional to g(θk) Π i:ci=k fθk(yi), where g is the density of the base measure G.
Open Source Code	Yes	A software implementation and the steps needed to replicate the results in this paper are provided on https://github.com/zengcheng/quasi-bernoulli-stick-breaking.
Open Datasets	Yes	To demonstrate the ease of using our model in an advanced data analysis, we apply it to cluster multiple brain networks, collected from n = 812 subjects in the human connectome project (Marcus et al., 2011).
Dataset Splits	No	We ﬁrst generate data with sample sizes n ∈ {50, 100, 250, 1000, 2500} from a threecomponent univariate Gaussian mixture distribution: 0.3 N(−4, 1^2)+0.3 N(0, 1^2)+0.4 N(5, 1^2). (Section 4.1) To demonstrate the ease of using our model in an advanced data analysis, we apply it to cluster multiple brain networks, collected from n = 812 subjects in the human connectome project (Marcus et al., 2011). (Section 5)
Hardware Specification	Yes	The algorithms are implemented in R, and run on a 4.0 GHz processor.
Software Dependencies	No	The algorithms are implemented in R, and run on a 4.0 GHz processor.
Experiment Setup	Yes	We set p = 0.9 for the quasi-Bernoulli probability in Equation (2), yielding a prior mean of no more than 1/(1 − p) = 10 components with mixture weights larger than ϵ. ... We set α = 1 and ϵ = 1/n^2.1, which satisﬁes the theoretical condition that ϵ = o(1/n^2) in Theorem 5. ... For each experiment, we run the Markov chain for 50,000 iterations, discard the ﬁrst 20,000 as burn-ins, and use thinning by keeping only every 50th iteration. (Section 4) For the quasi-Bernoulli prior on w, we use p = 0.9, α = 1 and ϵ = 1/n^2.1. We run the MCMC sampler from Section 3 for 30, 000 iterations and discard the ﬁrst 10, 000 as burn-ins. (Section 5)