reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attributing Culture-Conditioned Generations to Pretraining Corpora

Authors: Huihan Li, Arnav Goel, Keyu He, Xiang Ren

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using MEMOED on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none.
Researcher Affiliation	Academia	1University of Southern California 2IIIT Delhi EMAIL,EMAIL
Pseudocode	Yes	Algorithm 1 Calculate minimum token distance between two n-grams
Open Source Code	Yes	1https://github.com/huihanlhh/Culture Gen Attr
Open Datasets	Yes	We conduct all of our analysis on OLMo-7B (Groeneveld et al., 2024) and its pretraining corpora Dolma (Soldaini et al., 2024), as OLMo-7B is the most capable generative large language model with open-sourced and indexed pretraining data.
Dataset Splits	No	The paper analyzes outputs from a pre-trained model and generates data for its analysis. It describes sampling 100 generations for male, female, and gender-agnostic settings, resulting in 300 generations per culture, but does not provide traditional training/test/validation splits for a dataset used in training or evaluation of a new model.
Hardware Specification	No	The paper analyzes outputs from OLMo-7B and its pretraining data but does not specify the hardware (e.g., GPU models, CPU types) used by the authors to conduct their analysis or generate these outputs.
Software Dependencies	No	The paper mentions software tools like huggingface, LLAMA-3-70b-instruct, OLMo-Instruct-7B, LDA, and XLM-RoBERTa-large embeddings, but does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	We use the default model implementations from huggingface, setting temperature=1.0, top p=0.95, top k=50, max tokens=30 and num return sequences=100, and period ( . ) as the stopping criteria. ... Therefore, we set the threshold of z-score to 2.6 (> 99.5% of CG3) to find outliers in the distribution and classify the symbols as memorized for cultures whose z-score is above the threshold.