reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Linear Representations and Pretraining Data Frequency in Language Models

Authors: Jack Merullo, Noah Smith, Sarah Wiegreffe, Yanai Elazar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the connection between pretraining data frequency and models linear representations of factual relations (e.g., mapping France to Paris in a capital prediction task). We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies... In OLMo-7B and GPT-J (6B), we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively... Our results are summarized in Figure 2. We report training tokens because the step count differs between 7B and 1B. Co-occurrence frequencies highly correlate with causality (r = 0.82).
Researcher Affiliation	Collaboration	Jack Merullo Noah A. Smith Sarah Wiegreffe Yanai Elazar Brown University, Allen Institute for AI (Ai2), University of Washington Co-senior authors. jack EMAIL, EMAIL
Pseudocode	No	The paper describes methods and computations using natural language and mathematical equations, such as Equation 1 for LREs, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code to support future work.1 1Code is available at https://github.com/allenai/freq, and for efficient batch search at https://github.com/allenai/batchsearch.
Open Datasets	Yes	We use a subset of the RELATIONS dataset Hernandez et al. (2024), focusing on the 25 factual relations of the dataset... We use the OLMo model v1.7 (0424 7B and 0724 1B) (Groeneveld et al., 2024) and GPT-J (6B) (Wang & Komatsuzaki, 2021) and their corresponding datasets: Dolma (Soldaini et al., 2024) and the Pile (Gao et al., 2020), respectively.
Dataset Splits	Yes	Following Hernandez et al. (2024), we fit an LRE for each relation on 8 examples from that relation, each with a 5-shot prompt... We fit 24 models such that each relation is held out once per random seed across 4 seeds... In all settings, the held out set objects and relations are guaranteed to not have been in the training set.
Hardware Specification	Yes	Using our implementation, we are able to complete this on 900 CPUs in about a day.
Software Dependencies	No	The paper mentions 'Cython bindings' and 'existing libraries' but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	Following Hernandez et al. (2024), we fit an LRE for each relation on 8 examples from that relation, each with a 5-shot prompt... We train a random forest regression model with 100 decision tree estimators to predict the frequency of terms... Hyperparameter sweeps are in Appendix C. Beta is exclusive to measuring faithfulness and rank is exclusive to causality. We test the same ranges for each as in Hernandez et al. (2024): [0, 5] beta and [0, full rank] for causality at varying intervals. Those intervals are every 2 from [0,100], every 5 from [100,200], every 25 from [200, 500], every 50 from [500, 1000], every 250 from [1000, hidden size].