reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Authors: Sijin Chen, Omar Hagrass, Jason Klusowski

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments are conducted to complement our theoretical analysis. Building on the general theory, in Section 5, we propose Game sampling (Algorithm 1) and empirically evaluate its performance in open-ended text generation with GPT-2 models. The experiments suggest that Game sampling is able to outperform other strategies, which corroborates our optimality results.
Researcher Affiliation	Academia	Sijin Chen , Omar Hagrass , Jason M. Klusowski Princeton University, Princeton, NJ 08540, USA EMAIL
Pseudocode	Yes	Algorithm 1 Game sampling Input: 0 < ϵ 1, and τ > 0 if τ = 1 then log-likelihood objective compute SI = PI 1 i=1 ˆpi log(ˆpi/ˆp I) find ˆI = max{I : SI ϵ} set qi = ˆpi1(1 i ˆI) else if τ = 1 then temperature sampling objective compute SI = (1 1/τ) 1 PI 1 i=1 ˆpi 1 (ˆpi/ˆp I)1 1/τ find ˆI = max{I : SI ϵ} set qi = ˆp1/τ i 1(1 i ˆI) end if normalize qi: qi qi/ Pd j=1 qj sample the next word based on the distribution q
Open Source Code	Yes	The code is available at https://github.com/omar-hagrass/decoding-game.
Open Datasets	Yes	We conduct an open-ended text generation task using web text from the GPT-2 output dataset. For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens.
Dataset Splits	Yes	For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. ... for comparison with human-written text, we use the corresponding human continuations from the test set, up to a maximum of 256 tokens.
Hardware Specification	No	The paper mentions various models (GPT-2, GPT-J-6B, Llama-2-7B) but does not specify the hardware (e.g., GPU, CPU) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. For each type of GPT-2 model (Small, Medium, Large, XL) (Radford et al., 2019), GPT-J-6B (Wang & Komatsuzaki, 2021), and Llama-2-7B (Touvron et al., 2023), we evaluate the following metrics: 1. Perplexity: The perplexity of the generated text under the corresponding model. 2. Repetition frequency: The fraction of generations with repetitions. ... 3. MAUVE score (Pillutla et al., 2021): for comparison with human-written text... We use the best-performing hyperparameters for each strategy as determined by the MAUVE score. ... We also experiment with different values of ϵ and τ, with details in Appendix B. In general, larger values of ϵ tend to produce better results in terms of higher MAUVE score, lower repetition frequency, and human-level perplexity. Smaller ϵ values reduce perplexity, but at the expense of more repetitions and lower MAUVE scores. In terms of τ, the best performance for each ϵ choice was achieved at τ 2, yielding the highest MAUVE score.