Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
How good is Good-Turing for Markov samples?
Authors: Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-x relationships between λ x and πx. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap β, we show minimax rate upper and lower bounds of 1/(nβ5) and 1/(nβ), respectively, for the estimation of stationary missing mass. This theoretical result extends the 1/n minimax rate for i.i.d or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples. We also show, through experiments, that the MSE of GT decays at a slower rate as the rank of the t.p.m increases. |
| Researcher Affiliation | Academia | Prafulla Chandra EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Andrew Thangaraj EMAIL Department of Electrical Engineering, IIT Madras Chennai, India; Nived Rajaraman EMAIL Department of Electrical Engineering and Computer Science University of California Berkeley, CA 94720-1776, USA |
| Pseudocode | No | The paper primarily presents mathematical analyses, theorems, and proofs. It does not include any explicitly labeled pseudocode blocks or algorithm figures. |
| Open Source Code | No | The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide links to a code repository. |
| Open Datasets | Yes | NYT: The New York Times (NYT) corpus 2 consists of randomly collected articles from the front pages of New York Times from the years 2017 and 2018. To build an empirical t.p.m, we consider an 2available under CC0:Public domain license at https://www.kaggle.com/datasets/mathurinache/10700-articles-from-newyork-times. GE: The novel Great Expectations (GE) by Charles Dickens is available under the project Gutenberg (https://www.gutenberg.org/). |
| Dataset Splits | No | The paper describes generating stationary Markov sequences of specified lengths (e.g., n = 30, 60, 120, 240, 500, 1000) and performing trials (e.g., 5000, 16000 trials). This describes simulation parameters and repetition for averaging results, but not a typical train/test/validation split of a fixed dataset used for machine learning models. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud platforms) used to conduct the numerical experiments or simulations. |
| Software Dependencies | No | The paper does not mention any specific software packages, libraries, or programming languages with version numbers that were used for implementation or experimentation. |
| Experiment Setup | Yes | We generated stationary Markov sequences of lengths n = 30, 60, 120, 240, 500, 1000 from the rank 2 t.p.m specified above with K = 1.2n and with β = 1/n0.2, β = 1/n and computed the missing mass M0 and the GT estimate over 5000 trials. ...The parameters were chosen as K = 1.2n, β = 1/n0.2, 1/n0.5, 1/n0.8, 1/n, and the MSE was averaged over 16000 trials for each n. ...We generated stationary Markov sequences of lengths n = 100, 200, 400, 800, 1600, 3200, 6400 from the t.p.m specified above with K = 1.2n, L = n0.25, n0.5, n0.7, 1.2n, α = 0.5 and p as the uniform distribution. |