The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation
Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples a process we refer to as hyperfitting the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperforms Top-P sampling over long-sequences, both in terms of diversity and human preferences. To thoroughly evaluate the models ability to generate text in an open-ended setting, we conduct an extensive human evaluation study with verified English speakers independently hired as freelancers. In Table 1 we report the percentage of times that a continuation was either preferred or judged equally good to the original. |
| Researcher Affiliation | Collaboration | Fredrik Carlsson* Fangyu Liu Daniel Ward Murathan Kurfali* Joakim Nivre *RISE Research Institutes of Sweden Google Deep Mind Pw C Sweden Uppsala University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Skeleton code and models available at: github.com/Fredde Frallan/Hyperfitting |
| Open Datasets | Yes | All hyperfitted LLMs use the identical samples from the Fiction-Stories dataset (Forsythe, 2024). ...each of three datasets: Wikipedia (Merity et al., 2017), Fictional Stories (Forsythe, 2024), and BBC News (Li et al., 2024). ...To investigate the hyperfitting phenomenon for an additional modality, we hyperfit Image GPT-Large (774M parameters) (Chen et al., 2020) on 2,000 randomly selected images from CIFAR-10. |
| Dataset Splits | Yes | For all our experiments we train the model via the next-token prediction objective... all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. ...evaluating them at the end of each epoch on a validation set of 128 images. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | Unless otherwise specified, all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. We update all the model s parameters using the Adam optimizer with a learning rate of 1e-6 without weight decay, and use a batch size of 8. |