The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples a process we refer to as hyperfitting the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperforms Top-P sampling over long-sequences, both in terms of diversity and human preferences. To thoroughly evaluate the models ability to generate text in an open-ended setting, we conduct an extensive human evaluation study with verified English speakers independently hired as freelancers. In Table 1 we report the percentage of times that a continuation was either preferred or judged equally good to the original.
Researcher Affiliation Collaboration Fredrik Carlsson* Fangyu Liu Daniel Ward Murathan Kurfali* Joakim Nivre *RISE Research Institutes of Sweden Google Deep Mind Pw C Sweden Uppsala University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Skeleton code and models available at: github.com/Fredde Frallan/Hyperfitting
Open Datasets Yes All hyperfitted LLMs use the identical samples from the Fiction-Stories dataset (Forsythe, 2024). ...each of three datasets: Wikipedia (Merity et al., 2017), Fictional Stories (Forsythe, 2024), and BBC News (Li et al., 2024). ...To investigate the hyperfitting phenomenon for an additional modality, we hyperfit Image GPT-Large (774M parameters) (Chen et al., 2020) on 2,000 randomly selected images from CIFAR-10.
Dataset Splits Yes For all our experiments we train the model via the next-token prediction objective... all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. ...evaluating them at the end of each epoch on a validation set of 128 images.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes Unless otherwise specified, all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. We update all the model s parameters using the Adam optimizer with a learning rate of 1e-6 without weight decay, and use a batch size of 8.