Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies
Authors: Sijin Chen, Omar Hagrass, Jason Klusowski
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments are conducted to complement our theoretical analysis. Building on the general theory, in Section 5, we propose Game sampling (Algorithm 1) and empirically evaluate its performance in open-ended text generation with GPT-2 models. The experiments suggest that Game sampling is able to outperform other strategies, which corroborates our optimality results. |
| Researcher Affiliation | Academia | Sijin Chen , Omar Hagrass , Jason M. Klusowski Princeton University, Princeton, NJ 08540, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Game sampling Input: 0 < ϵ 1, and τ > 0 if τ = 1 then log-likelihood objective compute SI = PI 1 i=1 ˆpi log(ˆpi/ˆp I) find ˆI = max{I : SI ϵ} set qi = ˆpi1(1 i ˆI) else if τ = 1 then temperature sampling objective compute SI = (1 1/τ) 1 PI 1 i=1 ˆpi 1 (ˆpi/ˆp I)1 1/τ find ˆI = max{I : SI ϵ} set qi = ˆp1/τ i 1(1 i ˆI) end if normalize qi: qi qi/ Pd j=1 qj sample the next word based on the distribution q |
| Open Source Code | Yes | The code is available at https://github.com/omar-hagrass/decoding-game. |
| Open Datasets | Yes | We conduct an open-ended text generation task using web text from the GPT-2 output dataset. For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. |
| Dataset Splits | Yes | For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. ... for comparison with human-written text, we use the corresponding human continuations from the test set, up to a maximum of 256 tokens. |
| Hardware Specification | No | The paper mentions various models (GPT-2, GPT-J-6B, Llama-2-7B) but does not specify the hardware (e.g., GPU, CPU) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For each of the 5,000 articles in the Webtext test set, we use the first 35 tokens as prompts, with a maximum generation length of 256 tokens. For each type of GPT-2 model (Small, Medium, Large, XL) (Radford et al., 2019), GPT-J-6B (Wang & Komatsuzaki, 2021), and Llama-2-7B (Touvron et al., 2023), we evaluate the following metrics: 1. Perplexity: The perplexity of the generated text under the corresponding model. 2. Repetition frequency: The fraction of generations with repetitions. ... 3. MAUVE score (Pillutla et al., 2021): for comparison with human-written text... We use the best-performing hyperparameters for each strategy as determined by the MAUVE score. ... We also experiment with different values of ϵ and τ, with details in Appendix B. In general, larger values of ϵ tend to produce better results in terms of higher MAUVE score, lower repetition frequency, and human-level perplexity. Smaller ϵ values reduce perplexity, but at the expense of more repetitions and lower MAUVE scores. In terms of τ, the best performance for each ϵ choice was achieved at τ 2, yielding the highest MAUVE score. |