reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Search Problem in Mixture Models

Authors: Avik Ray, Joe Neeman, Sujay Sanghavi, Sanjay Shakkottai

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We simulate our algorithm on both real and synthetic data sets for the Gaussian mixture model, topic model, and subspace clustering applications. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing signiﬁcant improvement in runtime and accuracy. In this section we present the empirical performance of our Whitening, Cancellation, and Subspace clustering algorithms. We consider three of the settings: the Gaussian Mixture Model (GMM), and Latent Dirichlet Allocation (LDA), and Subspace clustering, and validate our algorithms on both real and synthetic data sets.
Researcher Affiliation	Academia	Avik Ray EMAIL Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA Joe Neeman EMAIL Department of Mathematics Rheinische Friedrich-Wilhelms-Universit at Bonn D-53115 Bonn, Germany Sujay Sanghavi EMAIL Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA Sanjay Shakkottai EMAIL Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78701, USA
Pseudocode	Yes	Algorithm 1 Extracting a mixture component from side information: the whitening method. Algorithm 2 Extracting a mixture component from side information: the cancellation method. Algorithm 3 Subspace clustering algorithm
Open Source Code	No	The paper does not contain any explicit statements about open-sourcing their code, nor does it provide any links to code repositories. It discusses the algorithms implemented and their performance, but not their public availability.
Open Datasets	Yes	Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms... NY Times data set [UCI 2008] (300, 000 articles) (b) Yelp data set of business reviews [Yelp 2014] (335, 022 reviews)... BSDS500 data set introduced in Arbelaez et al. (2011)
Dataset Splits	No	The paper mentions generating synthetic data with various parameters and for real datasets, it states: "For this we consider the BSDS500 data set introduced in Arbelaez et al. (2011) and choose a subset of 70 images having less than 4 segments in the ground truth. Note that this data set has up to six ground truth segmentation by human users for each image. We randomly choose one pixel from each segment in ground truth as side-information v." This describes how a subset was chosen and side-information was derived, but not explicit train/test/validation splits for model training or evaluation as typically required for reproduction.
Hardware Specification	No	The authors also acknowledge the Texas Advanced Computing Center [TACC 2018] at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. While TACC is an HPC resource, no specific hardware models (e.g., GPU, CPU models) or configurations used for the experiments are provided.
Software Dependencies	No	We implement all algorithms for our synthetic data experiments using MATLAB. All three algorithms were implemented in Python. The paper mentions the programming languages used (MATLAB, Python) but does not provide specific version numbers for these, nor for any libraries or frameworks used within them.
Experiment Setup	Yes	We generate synthetic data sets for GMM with diﬀerent k, d, αi, σ, and v. Figure 1 shows the percentage relative error gains of the Whitening, Cancellation, and Fast-TPM algorithms over the TPM algorithm in a GMM with various values of k, d, αi, σ, and n. The µi were generated randomly over the sphere of norm r = 10. We deﬁne αmin := mini αi... In Figure 4 we plot the percentage relative error gain... mean document length L {2000, 3000}, and number of documents (a) n = 4000 (b) n = 6000 (c) n = 8000. For Subspace Clustering: we generate synthetic data for the subspace clustering model described in section 3.4 using parameters d = 500, k = 5, m = 10, and αi [.1, .3]... add white Gaussian perturbations with σ {.1, .2}.