Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Authors: Sijin Chen, Omar Hagrass, Jason Klusowski

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments are conducted to complement our theoretical analysis. Building on the general theory, in Section 5, we propose Game sampling (Algorithm 1) and empirically evaluate its performance in open-ended text generation with GPT-2 models. The experiments suggest that Game sampling is able to outperform other strategies, which corroborates our optimality results.
Researcher Affiliation Academia Sijin Chen , Omar Hagrass , Jason M. Klusowski Princeton University, Princeton, NJ 08540, USA EMAIL
Pseudocode Yes Algorithm 1 Game sampling Input: 0 < ϵ 1, and τ > 0 if τ = 1 then log-likelihood objective compute SI = PI 1 i=1 ˆpi log(ˆpi/ˆp I) find ˆI = max{I : SI ϵ} set qi = ˆpi1(1 i ˆI) else if τ = 1 then temperature sampling objective compute SI = (1 1/τ) 1 PI 1 i=1 ˆpi 1 (ˆpi/ˆp I)1 1/τ find ˆI = max{I : SI ϵ} set qi = ˆp1/τ i 1(1 i ˆI) end if normalize qi: qi qi/ Pd j=1 qj sample the next word based on the distribution q
Open Source Code Yes The code is available at https://github.com/omar-hagrass/decoding-game.
Open Datasets Yes We conduct an open-ended text generation task using web text from the GPT-2 output dataset. For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens.
Dataset Splits Yes For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. ... for comparison with human-written text, we use the corresponding human continuations from the test set, up to a maximum of 256 tokens.
Hardware Specification No The paper mentions various models (GPT-2, GPT-J-6B, Llama-2-7B) but does not specify the hardware (e.g., GPU, CPU) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. For each type of GPT-2 model (Small, Medium, Large, XL) (Radford et al., 2019), GPT-J-6B (Wang & Komatsuzaki, 2021), and Llama-2-7B (Touvron et al., 2023), we evaluate the following metrics: 1. Perplexity: The perplexity of the generated text under the corresponding model. 2. Repetition frequency: The fraction of generations with repetitions. ... 3. MAUVE score (Pillutla et al., 2021): for comparison with human-written text... We use the best-performing hyperparameters for each strategy as determined by the MAUVE score. ... We also experiment with different values of ϵ and τ, with details in Appendix B. In general, larger values of ϵ tend to produce better results in terms of higher MAUVE score, lower repetition frequency, and human-level perplexity. Smaller ϵ values reduce perplexity, but at the expense of more repetitions and lower MAUVE scores. In terms of τ, the best performance for each ϵ choice was achieved at τ 2, yielding the highest MAUVE score.