reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparse Markov Models for High-dimensional Inference

Authors: Guilherme Ost, Daniel Y. Takahashi

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using simulations, we show that our method performs well compared to other relevant methods. We also illustrate the usefulness of our method on weather data where the proposed method correctly recovers the long-range dependence.
Researcher Affiliation	Academia	Guilherme Ost EMAIL Institute of Mathematics Federal University of Rio de Janeiro Rio de Janeiro, RJ, Brazil Daniel Y. Takahashi EMAIL Brain Institute Federal University of Rio Grande do Norte Natal, RN, Brazil
Pseudocode	Yes	Algorithm 1: FSC(X1, . . . , Xn) FS Step; 1. Sm = ; 2. While \| Sm\| < ℓ; 3. Compute j = arg maxk Sc νm,k,Sm and include j in Sm; CUT step; 6. For each j Sm, remove j from Sm unless d TV (pm,n( \|x Sm), pm,n( \|y Sm)) tm,n(x Sm, y Sm), for some ( Sm {j})-compatible pasts x Sm, y Sm A Sm ; 7. Output Sm;
Open Source Code	No	We used the code available at https://github.com/david-dunson/bnphomc. The maximal possible order of the Markov chain was set to d and the number of simulation for the Gibbs sampler was set to 1000. The set of relevant lags chosen by CTF was given by the lags with non-null inclusion probability estimated using the Gibbs sampler.
Open Datasets	Yes	We applied the proposed method to study the relevant lags on a daily weather data registering the rainy and non-rainy days in Canberra Australia for a n = 1000 days. We obtained the data from kaggle (https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package).
Dataset Splits	Yes	We used the ﬁrst n/2 samples for the Forward Stepwise and the last n/2 for Cut. Remember that ϵ, µ, α are used to deﬁne the random threshold for the Cut step. BSS(2) stands for the best subset selection algorithm, where we ﬁrst estimated the parameters of the MTD model using n samples and the algorithm described in Berchtold (2001) with python implementation mtd-learn.
Hardware Specification	No	We were not able to run the mtd-learn on models with order d larger than 15 in our computers because that algorithm did not converge. We were not able to run CTF(ℓ) when j = n/5 and d = n/4 because the algorithm did not converge when n > 103.
Software Dependencies	No	We were not able to run the mtd-learn on models with order d larger than 15 in our computers because that algorithm did not converge. Finally, CTF(ℓ) stands for Conditional Tensor Factorization based Higher Order Markov Chain estimation together with the test for relevance of lags described in Sarkar and Dunson (2016), the parameter ℓbeing the maximal number of relevant lags. We used the code available at https://github.com/david-dunson/bnphomc.
Experiment Setup	Yes	FSC(ℓ) stands for the Forward Stepwise and Cut algorithm described in Algorithm 1 with parameter ℓ, ϵ = 0.1, µ = 0.5, and α = C log(n), where the values of the constant C was chosen by optimizing the probability to select the relevant lags correctly only for sample size n = 100, for the given choice of d, i and j. We used the ﬁrst n/2 samples for the Forward Stepwise and the last n/2 for Cut. Remember that ϵ, µ, α are used to deﬁne the random threshold for the Cut step.