Clustering with Hidden Markov Model on Variable Blocks
Authors: Lin Lin, Jia Li
JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on simulated and real data show that our proposed method outperforms other widely used methods. Keywords: Gaussian mixture model, hidden Markov model, modal Baum-Welch algorithm, modal clustering... In Section 5, experimental results are reported for both simulated and real data including mass cytometry, single-cell genomics, and image data. Comparisons are made with some competing models and popular methods. |
| Researcher Affiliation | Academia | Lin Lin EMAIL Jia Li EMAIL Department of Statistics Pennsylvanian State University University Park, PA 16802, USA |
| Pseudocode | Yes | Our step-wise selection algorithm under a given raw ordering Q is as follows: 1. Input data matrix X and the ordering structure {Q(1), ..., Q(d)}. 2. Set j = 1, g = 1, G(Q(1)) = 1. 3. For j = 2, ..., d (a) For each k = 1, ..., g, g + 1, obtain the maximum likelihood estimation of HMM-VB for partial data composed of XQ(1), XQ(2), ..., XQ(j) under structure Q and (G(Q(1)), ..., G(Q(j 1)), G(Q(j)) = k). Let the estimated parameter at k be θ ,k. (b) Compute G (Q(j)) = argmin k {1,...,g,g+1} LBIC(XQ(1),...,Q(j), G(Q(1)), ..., G(Q(j)) = k, θ ,k). (c) Set G(Q(j)) G (Q(j)). (d) If G(Q(j)) = g + 1, set g g + 1. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It references third-party R packages like Mclust, R pdf Cluster, R kernlab, and Mean Shift packages, but not their own implementation. |
| Open Datasets | Yes | We study the performance of HMM-VB on a data set obtained from Cy TOF experiment (Becher et al., 2014)... The data set is obtained from the Github repository: https://github.com/JustinaZ/pcaReduce provided by ˇZurauskien e and Yau (2016). |
| Dataset Splits | No | The paper uses simulated data and real-world datasets for clustering, and evaluates the performance against ground truth or other methods. However, it does not specify explicit training/test/validation splits. For example, for simulated data, it states 'a sample of size 10,000 with dimension d = 8 is drawn from a hierarchical mixture model', and for real data, 'total contains 46,204 single cells with 39 measured cell markers'. These are total dataset sizes, not split information for reproduction. |
| Hardware Specification | Yes | The CPU time per model fitting (based on the best model specification) on an i Mac with Intel Core i7 3.0GHz/8GB memory is recorded, given in the last row of Table 1... On a Mac with Intel Core i5 3.5GHz/16GB Memory, the time for training is respectively 263, 326, 2118 seconds for the three data sets... |
| Software Dependencies | Yes | For K-means, hierarchical clustering, and Mclust, we used the three R functions kmeans (with 20 starting points), hclust and Mclust. In this analysis, we treat the 16 clusters found by MBW applied to the true mixture density as the ground truth. Hence, for K-means and hierarchical clustering, we manually set 16 as the number of clusters. The number of normal mixture components and the covariance structure are determined by BIC in the Mclust package (the package version is 5.2.3). |
| Experiment Setup | Yes | To initialize the model, we design several schemes. In our experiments, models from different initializations are estimated and the one with the maximum likelihood is chosen. In our baseline initialization scheme, k-means clustering is applied individually to each variable block using all the data instances... The transition probabilities are always initialized to be uniform... After extensive numerical experiments, we set Mt = 10 if dt <= 5, Mt = 15 if dt in [6, 10], otherwise, Mt = dt + 10 for t being any variable block. |