reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Determine the Number of States in Hidden Markov Models via Marginal Likelihood

Authors: Yang Chen, Cheng-Der Fuh, Chu-Lan Michael Kao

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we propose a consistent method for determining the number of hidden states of HMM based on the marginal likelihood, which is obtained by integrating out both the parameters and hidden states. Moreover, we show that the model selection problem of HMM includes the order selection problem of ﬁnite mixture models as a special case. We give rigorous proof of the consistency of the proposed marginal likelihood method and provide an eﬃcient computation method for practical implementation. We numerically compare the proposed method with the Bayesian information criterion (BIC), demonstrating the eﬀectiveness of the proposed marginal likelihood method. Keywords: hidden Markov models, model selection, marginal likelihood, consistency, normalizing constant
Researcher Affiliation	Academia	Yang Chen EMAIL Department of Statistics University of Michigan 1085 South University Ann Arbor, MI 48109-1107, USA Cheng-Der Fuh EMAIL Graduate Institute of Statistics National Central University No. 300, Zhongda Rd., Zhongli District, Taoyuan City 320317, Taiwan Chu-Lan Michael Kao EMAIL Institute of Statistics National Yang-Ming Chiao-Tung University 1001 University Road, Hsinchu 30010, Taiwan
Pseudocode	Yes	We now give our procedure for estimating the marginal likelihood p K(y1:n) for each K {1, 2, . . . , K}. 1. Obtain posterior samples. Sample from p(φK\|y1:n) using a preferred MCMC algorithm, and denote the samples by {φ(i) K }N i=1 (where N is often a few thousand). 2. Find a good importance function. Fit a mixture model using the samples {φ(i) K }N i=1, where the number of mixing components is given by either (a) any clustering algorithm, or (b) a pre-ﬁxed number which is large enough. Construct the importance function g( ) by ﬁtting a Gaussian mixture or using a heavier-tailed density as the mixture component; for example, a student-t distribution with a small degree of freedom, such as 2 or 3, with the same location and scale parameters as the ﬁtted Gaussian mixture components. 3. Choose a ﬁnite region. Choose ΩK to be a bounded subset of the parameter space such that 1/2 < R ΩK g(φK)dφK < 1. This can be achieved by ﬁnding an appropriate ﬁnite region for each mixing component of g( ), avoiding the tail parts. 4. Estimate p K(y1:n) using either way as follows: Reciprocal importance sampling. Approximate p K(y1:n) by ˆp(RIS) K (y1:n) = ΩK g(φK)dφK p(y1:n, φ(i) K ) 1ΩK(φ(i) K ) where 1Ω(x) is 1 if x Ω, and zero otherwise. Importance sampling. (a) Draw M independent samples from g( ), denoted by {ψ(j) K }1 j M. (b) Approximate p K(y1:n) by ˆp(IS) K (y1:n) = 1 MPΩ p(y1:n, ψ(j) K ) g(ψ(j) K ) 1ΩK(ψ(i) K ), (17) where PΩ= #S/N with S = {i : φ(i) K ΩK; 1 i N}.
Open Source Code	Yes	(4) An easy-to-use R package, HMMmlselect, which implements our algorithm, is provided and publicly available at CRAN (https://cran.r-project.org/package=HMMmlselect).
Open Datasets	Yes	We apply the proposed marginal likelihood method to a set of single-molecule experimental data studied in Chen et al. (2016b), where single-molecule experiments are conducted to study the co-translational protein targeting process
Dataset Splits	No	The paper describes simulation studies where data is generated (e.g., "Simulate n observations from the HMM") and real-world data analysis from a prior study, but it does not specify any training/test/validation dataset splits for reproducibility. The simulation setup describes how data is created, not partitioned.
Hardware Specification	No	The paper does not provide specific details about the hardware used to conduct the numerical experiments or simulations.
Software Dependencies	No	The paper mentions an "R package, HMMmlselect" and refers to the "Baum-Welch (EM) algorithm" but does not specify version numbers for R or any other software dependencies.
Experiment Setup	Yes	In the numerical experiments, we ﬁx the mean parameters of a K-state HMM to be µ = (1, 2, . . . , K) and vary the variances σ2 = (σ2, . . . , σ2) in the ﬁrst set of simulation studies. The equal variances assumption is adopted here for simplicity of the presentation of the results, but this is not part of the model assumptions. We will relax this assumption in the second set of simulations. We consider four kinds of transition matrices, corresponding to ﬂat (P (1) K ), moderate and strongly diagonal (P (2) K , P (3) K ) and strongly oﬀ-diagonal (P (4) K ) cases: ... The number of observations, n, varies from 200 to 2000, and the true number of hidden states, K, ranges from 3 to 5. ... We conduct m = 200 repeated simulations, each of which compares the marginal likelihood method with the BIC as follows. ... The prior mean for the mean parameters of the K states are set as the K equally spaced quantile levels between 0.05 and 0.95 of the observed trajectory, and the corresponding variances are set to be 1002. The priors for each row of the transition matrix are ﬂat, that is, Dirichlet with hyperparameters all equal to 1. For the scaled-inverse chi-square prior for the variance parameters, we set the degree of freedom parameter ν = 3 and the scale parameter s2 as: s = quantile(y, 0.75) quantile(y, 0.25)))/(2 K)