reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

How new data permeates LLM knowledge and how to dilute it

Authors: Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that when learning new information, LLMs exhibit a "priming" effect...To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset...Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability...This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques...reduce undesirable priming effects by 50-95%...
Researcher Affiliation	Industry	Chen Sun , Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler Google Deep Mind EMAIL EMAIL
Pseudocode	No	The paper includes mathematical formulas and descriptions of procedures (e.g., in Section A.6 and A.7) but does not contain any structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code	Yes	Further materials: https://sunchipsster1.github.io/projects/outlandish/.
Open Datasets	Yes	we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM s existing knowledge base. Using this dataset...Further materials: https://sunchipsster1.github.io/projects/outlandish/. ...For instruction ﬁne-tuning, the Alpaca query-response dataset (Taori et al., 2023) was used while for continued pre-training, the wikipedia dataset was used (Foundation).
Dataset Splits	No	Each Outlandish sample was learned by a language model using gradient update on typical next word prediction loss...Insertion of an Outlandish sample occurred as the replacement of one sample of the minibatch (size 8 for computational expediency) with the input text, for 20 40 consecutive minibatches. ...Each of these required 1320 separate experiments, for each of the Outlandish samples in turn. The paper describes the learning of individual samples or the use of existing datasets like Alpaca and Wikipedia for fine-tuning/pre-training, but does not provide explicit train/test/validation splits within the paper for any custom datasets or a specific splitting methodology for existing ones.
Hardware Specification	No	The paper mentions evaluating models like PALM-2, Gemma, and Llama across different sizes, but does not specify any particular hardware components (e.g., GPU models, CPU types, or TPU versions) used for these experiments.
Software Dependencies	No	Section A.4 "TRAINING PROCEDURES" states "learning was conducted using the adam optimizer with constant learning rate 5e-5." While Adam is an optimizer, no specific software libraries (e.g., TensorFlow, PyTorch) or their version numbers are mentioned.
Experiment Setup	Yes	Learning took place in both instruction ﬁne-tuning and continued pre-training tasks...learning was conducted using the adam optimizer with constant learning rate 5e-5. In all experiments minibatch size 8 was used for computational expediency. Models tested included PALM-2-xs, PALM-2-s, FLAN, GEMMA-2b, and LLAMA-7b. Insertion of an Outlandish sample occurred as the replacement of one sample of the minibatch with the input text, for 20 to 40 consecutive minibatches (20 for all experiments on Alpaca, 40 for experiments on wikipedia, though 20 for wikipedia was sufﬁcient to exhibit the robust keyword prob vs priming relationship Fig. 21).