reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributed Learning of Finite Gaussian Mixtures

Authors: Qiong Zhang, Jiahua Chen

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments based on simulated and real-world datasets show that the proposed estimator has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even outperform the global estimator for the purpose of clustering if the model assumption does not fully match the real-world data. It also has better statistical and computational performance than some existing split-and-conquer approaches. Numerical experiments on simulated, public, and real datasets are presented in Section 7.
Researcher Affiliation	Academia	Qiong Zhang EMAIL Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4, Canada Jiahua Chen EMAIL Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4, Canada
Pseudocode	Yes	Algorithm 1 MM algorithm for GMR estimator with KL-divergence cost function. Initialization: Φγ, γ [K] repeat for γ [K] do for i [MK] do ( wi if γ = arg minγ DKL(Φi, Φγ ) 0 otherwise end for Let i=1 πiγ, µγ = i=1 {πiγ/π γ}µi i=1 {πiγ/π γ}{Σi + (µi µγ)(µi µγ) } end for until the change in the value of the objective function P i,γ πiγDKL(Φi, Φγ) is below some threshold ϵ > 0 Let vγ = P i πiγ for γ [K] Output: {(vγ, µγ, Σγ) : γ [K]}
Open Source Code	Yes	The codes are written in Python and are publicly available at https://github.com/Sarah Qiong/SCGMM.
Open Datasets	Yes	1. MAGIC04. This is a simulated dataset for classifying gamma particles in the upper atmosphere. It contains 19, 020 observations with 10 real valued features and is publicly available at UCI machine learning repository. 2. Mini Boo NE. The dataset is taken from the Mini Boo NE experiment... publicly available at UCI machine learning repository. 3. KDD. This dataset is used in Lucic et al. (2017). It contains 145, 751 observations with 74 real valued features... available at https: //kdd.org/kdd-cup/view/kdd-cup-2004/Data. 4. MSYP. The dataset is used to predict the release year of a song from audio features... publicly available at UCI machine learning repository. For the first three datasets, we divide the dataset onto M = 4 local machines completely at random. Since MSYP is very big and the order of the mixture to be fitted is high, we divide the dataset onto M = 16 local machines. The random partition of the dataset is repeated R = 100 times. We apply the proposed GMR approach to fit a finite GMM to an atmospheric dataset 3 named CCSM run cam5.1.amip.2d.001 following Chen et al. (2013). These data are computer simulated based on Community Atmosphere Model version 5 (CAM5). Available at https://www.earthsystemgrid.org/dataset. We use the second edition of the dataset, named by_class.zip 1. It consists of approximately 4M images of handwritten digits and characters (0 9, A Z, and a z) by different writers. Available at https://www.nist.gov/srd/nist-special-database-19.
Dataset Splits	Yes	We randomly select R = 100 datasets of size N = 50K from the training set. Each dataset is then randomly partitioned into M = 10 subsets. For the ﬁrst three datasets, we divide the dataset onto M = 4 local machines completely at random. Since MSYP is very big and the order of the mixture to be ﬁtted is high, we divide the dataset onto M = 16 local machines. The random partition of the dataset is repeated R = 100 times. The simulated data are divided evenly over the local machines.
Hardware Specification	Yes	All experiments are conducted on the Compute Canada (Baldwin, 2012) Cedar cluster with Intel E5 Broadwell CPUs with 64G memory.
Software Dependencies	Yes	The codes are written in Python and are publicly available at https://github.com/Sarah Qiong/SCGMM. We use kmeans++ with default arguments in scikit-learn package (Pedregosa et al., 2011) to generate 10 initial values for the EM algorithm. We implement the CNN in pytorch 1.5.0 (Paszke et al., 2019) and train it for 10 epochs on the NIST training dataset.
Experiment Setup	Yes	We use the EM algorithm to compute p MLE and declare convergence when the per observation penalized log-likelihood function is less than 10 6. For real-world data, we use kmeans++ with default arguments in scikit-learn package (Pedregosa et al., 2011) to generate 10 initial values for the EM algorithm. We run the EM algorithm with these 10 initial values for 20 iterations and pick the one with the highest penalized log-likelihood value. We use the output of this one as the initial value to run the EM algorithm further until convergence and this output is treated as the p MLE. We declare the convergence of the MM algorithm for the GMR estimator when the change in the objective function is less than 10 6. We use the SGD optimizer with learning rate 0.01, momentum 0.9, and batch size 64.