Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

How good is Good-Turing for Markov samples?

Authors: Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-x relationships between λ x and πx. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap β, we show minimax rate upper and lower bounds of 1/(nβ5) and 1/(nβ), respectively, for the estimation of stationary missing mass. This theoretical result extends the 1/n minimax rate for i.i.d or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples. We also show, through experiments, that the MSE of GT decays at a slower rate as the rank of the t.p.m increases.
Researcher Affiliation Academia Prafulla Chandra EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Andrew Thangaraj EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Nived Rajaraman EMAIL Department of Electrical Engineering and Computer Science University of California Berkeley, CA 94720-1776, USA
Pseudocode No The paper primarily presents mathematical analyses, theorems, and proofs. It does not include any explicitly labeled pseudocode blocks or algorithm figures.
Open Source Code No The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide links to a code repository.
Open Datasets Yes NYT: The New York Times (NYT) corpus 2 consists of randomly collected articles from the front pages of New York Times from the years 2017 and 2018. To build an empirical t.p.m, we consider an 2available under CC0:Public domain license at https://www.kaggle.com/datasets/mathurinache/10700-articles-from-newyork-times. GE: The novel Great Expectations (GE) by Charles Dickens is available under the project Gutenberg (https://www.gutenberg.org/).
Dataset Splits No The paper describes generating stationary Markov sequences of specified lengths (e.g., n = 30, 60, 120, 240, 500, 1000) and performing trials (e.g., 5000, 16000 trials). This describes simulation parameters and repetition for averaging results, but not a typical train/test/validation split of a fixed dataset used for machine learning models.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud platforms) used to conduct the numerical experiments or simulations.
Software Dependencies No The paper does not mention any specific software packages, libraries, or programming languages with version numbers that were used for implementation or experimentation.
Experiment Setup Yes We generated stationary Markov sequences of lengths n = 30, 60, 120, 240, 500, 1000 from the rank 2 t.p.m specified above with K = 1.2n and with β = 1/n0.2, β = 1/n and computed the missing mass M0 and the GT estimate over 5000 trials. ...The parameters were chosen as K = 1.2n, β = 1/n0.2, 1/n0.5, 1/n0.8, 1/n, and the MSE was averaged over 16000 trials for each n. ...We generated stationary Markov sequences of lengths n = 100, 200, 400, 800, 1600, 3200, 6400 from the t.p.m specified above with K = 1.2n, L = n0.25, n0.5, n0.7, 1.2n, α = 0.5 and p as the uniform distribution.