reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How good is Good-Turing for Markov samples?

Authors: Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-x relationships between λ x and πx. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap β, we show minimax rate upper and lower bounds of 1/(nβ5) and 1/(nβ), respectively, for the estimation of stationary missing mass. This theoretical result extends the 1/n minimax rate for i.i.d or rank-1 t.p.ms to rank-2 Markov, and is a ﬁrst such minimax rate result for missing mass of Markov samples. We also show, through experiments, that the MSE of GT decays at a slower rate as the rank of the t.p.m increases.
Researcher Affiliation	Academia	Prafulla Chandra EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Andrew Thangaraj EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Nived Rajaraman EMAIL Department of Electrical Engineering and Computer Science University of California Berkeley, CA 94720-1776, USA
Pseudocode	No	The paper primarily presents mathematical analyses, theorems, and proofs. It does not include any explicitly labeled pseudocode blocks or algorithm figures.
Open Source Code	No	The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide links to a code repository.
Open Datasets	Yes	NYT: The New York Times (NYT) corpus 2 consists of randomly collected articles from the front pages of New York Times from the years 2017 and 2018. To build an empirical t.p.m, we consider an 2available under CC0:Public domain license at https://www.kaggle.com/datasets/mathurinache/10700-articles-from-newyork-times. GE: The novel Great Expectations (GE) by Charles Dickens is available under the project Gutenberg (https://www.gutenberg.org/).
Dataset Splits	No	The paper describes generating stationary Markov sequences of specified lengths (e.g., n = 30, 60, 120, 240, 500, 1000) and performing trials (e.g., 5000, 16000 trials). This describes simulation parameters and repetition for averaging results, but not a typical train/test/validation split of a fixed dataset used for machine learning models.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud platforms) used to conduct the numerical experiments or simulations.
Software Dependencies	No	The paper does not mention any specific software packages, libraries, or programming languages with version numbers that were used for implementation or experimentation.
Experiment Setup	Yes	We generated stationary Markov sequences of lengths n = 30, 60, 120, 240, 500, 1000 from the rank 2 t.p.m speciﬁed above with K = 1.2n and with β = 1/n0.2, β = 1/n and computed the missing mass M0 and the GT estimate over 5000 trials. ...The parameters were chosen as K = 1.2n, β = 1/n0.2, 1/n0.5, 1/n0.8, 1/n, and the MSE was averaged over 16000 trials for each n. ...We generated stationary Markov sequences of lengths n = 100, 200, 400, 800, 1600, 3200, 6400 from the t.p.m speciﬁed above with K = 1.2n, L = n0.25, n0.5, n0.7, 1.2n, α = 0.5 and p as the uniform distribution.