reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Authors: Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples a process we refer to as hyperfitting the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperforms Top-P sampling over long-sequences, both in terms of diversity and human preferences. To thoroughly evaluate the models ability to generate text in an open-ended setting, we conduct an extensive human evaluation study with verified English speakers independently hired as freelancers. In Table 1 we report the percentage of times that a continuation was either preferred or judged equally good to the original.
Researcher Affiliation	Collaboration	Fredrik Carlsson* Fangyu Liu Daniel Ward Murathan Kurfali* Joakim Nivre *RISE Research Institutes of Sweden Google Deep Mind Pw C Sweden Uppsala University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Skeleton code and models available at: github.com/Fredde Frallan/Hyperfitting
Open Datasets	Yes	All hyperfitted LLMs use the identical samples from the Fiction-Stories dataset (Forsythe, 2024). ...each of three datasets: Wikipedia (Merity et al., 2017), Fictional Stories (Forsythe, 2024), and BBC News (Li et al., 2024). ...To investigate the hyperfitting phenomenon for an additional modality, we hyperfit Image GPT-Large (774M parameters) (Chen et al., 2020) on 2,000 randomly selected images from CIFAR-10.
Dataset Splits	Yes	For all our experiments we train the model via the next-token prediction objective... all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. ...evaluating them at the end of each epoch on a validation set of 128 images.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	Unless otherwise specified, all LLMs use the following training setup: 20 epochs on 2000 randomly selected sequences from a given dataset, with a length of 256 tokens. We update all the model s parameters using the Adam optimizer with a learning rate of 1e-6 without weight decay, and use a batch size of 8.