reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Clustering with Hidden Markov Model on Variable Blocks

Authors: Lin Lin, Jia Li

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on simulated and real data show that our proposed method outperforms other widely used methods. Keywords: Gaussian mixture model, hidden Markov model, modal Baum-Welch algorithm, modal clustering... In Section 5, experimental results are reported for both simulated and real data including mass cytometry, single-cell genomics, and image data. Comparisons are made with some competing models and popular methods.
Researcher Affiliation	Academia	Lin Lin EMAIL Jia Li EMAIL Department of Statistics Pennsylvanian State University University Park, PA 16802, USA
Pseudocode	Yes	Our step-wise selection algorithm under a given raw ordering Q is as follows: 1. Input data matrix X and the ordering structure {Q(1), ..., Q(d)}. 2. Set j = 1, g = 1, G(Q(1)) = 1. 3. For j = 2, ..., d (a) For each k = 1, ..., g, g + 1, obtain the maximum likelihood estimation of HMM-VB for partial data composed of XQ(1), XQ(2), ..., XQ(j) under structure Q and (G(Q(1)), ..., G(Q(j 1)), G(Q(j)) = k). Let the estimated parameter at k be θ ,k. (b) Compute G (Q(j)) = argmin k {1,...,g,g+1} LBIC(XQ(1),...,Q(j), G(Q(1)), ..., G(Q(j)) = k, θ ,k). (c) Set G(Q(j)) G (Q(j)). (d) If G(Q(j)) = g + 1, set g g + 1.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It references third-party R packages like Mclust, R pdf Cluster, R kernlab, and Mean Shift packages, but not their own implementation.
Open Datasets	Yes	We study the performance of HMM-VB on a data set obtained from Cy TOF experiment (Becher et al., 2014)... The data set is obtained from the Github repository: https://github.com/JustinaZ/pcaReduce provided by ˇZurauskien e and Yau (2016).
Dataset Splits	No	The paper uses simulated data and real-world datasets for clustering, and evaluates the performance against ground truth or other methods. However, it does not specify explicit training/test/validation splits. For example, for simulated data, it states 'a sample of size 10,000 with dimension d = 8 is drawn from a hierarchical mixture model', and for real data, 'total contains 46,204 single cells with 39 measured cell markers'. These are total dataset sizes, not split information for reproduction.
Hardware Specification	Yes	The CPU time per model ﬁtting (based on the best model speciﬁcation) on an i Mac with Intel Core i7 3.0GHz/8GB memory is recorded, given in the last row of Table 1... On a Mac with Intel Core i5 3.5GHz/16GB Memory, the time for training is respectively 263, 326, 2118 seconds for the three data sets...
Software Dependencies	Yes	For K-means, hierarchical clustering, and Mclust, we used the three R functions kmeans (with 20 starting points), hclust and Mclust. In this analysis, we treat the 16 clusters found by MBW applied to the true mixture density as the ground truth. Hence, for K-means and hierarchical clustering, we manually set 16 as the number of clusters. The number of normal mixture components and the covariance structure are determined by BIC in the Mclust package (the package version is 5.2.3).
Experiment Setup	Yes	To initialize the model, we design several schemes. In our experiments, models from diﬀerent initializations are estimated and the one with the maximum likelihood is chosen. In our baseline initialization scheme, k-means clustering is applied individually to each variable block using all the data instances... The transition probabilities are always initialized to be uniform... After extensive numerical experiments, we set Mt = 10 if dt <= 5, Mt = 15 if dt in [6, 10], otherwise, Mt = dt + 10 for t being any variable block.