reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Inconsistency of Pitman-Yor Process Mixtures for the Number of Components

Authors: Jeffrey W. Miller, Matthew T. Harrison

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this manuscript, we prove that under fairly general conditions, when using a Dirichlet process mixture, the posterior on the number of clusters will not concentrate at any ﬁnite value, and therefore will not be consistent for the number of components in a ﬁnite mixture. In fact, our results apply to a large class of nonparametric mixtures including DPMs, and Pitman Yor process mixtures (PYMs) more generally, over a wide variety of families of component distributions.
Researcher Affiliation	Academia	Jeﬀrey W. Miller EMAIL Matthew T. Harrison EMAIL Division of Applied Mathematics Brown University Providence, RI 02912, USA
Pseudocode	No	The paper describes theoretical proofs and mathematical formulations but does not contain any clearly labeled pseudocode or algorithm blocks. The methods are explained in textual mathematical notation.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code, nor does it include links to a code repository.
Open Datasets	Yes	We begin with a motivating example. In population genetics, determining the population structure is an important step in the analysis of sampled data. To illustrate, consider the impala, a species of antelope in southern Africa. ... Lorenzen et al. (2006) collected samples from 216 impalas, and analyzed the genetic variation between/within the two subspecies.
Dataset Splits	No	The paper mentions data from Lorenzen et al. (2006) and simulated data, but it does not provide specific details about how these datasets are split into training, validation, or test sets for experimental reproduction.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the described simulations or analyses.
Software Dependencies	No	The paper mentions using 'Gibbs sampling (Mac Eachern, 1994; Neal, 2000)' for estimates, and 'Structure (Pritchard et al., 2000)' and 'Structurama (Huelsenbeck and Andolfatto, 2007)' as tools used in related works, but it does not specify version numbers for any software dependencies directly used in their own work or simulations.
Experiment Setup	Yes	For both (a) and (b), estimates were made via Gibbs sampling (Mac Eachern, 1994; Neal, 2000), with 10^5 burn-in sweeps and 2 × 10^5 sample sweeps.