Attributing Culture-Conditioned Generations to Pretraining Corpora

Authors: Huihan Li, Arnav Goel, Keyu He, Xiang Ren

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using MEMOED on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none.
Researcher Affiliation Academia 1University of Southern California 2IIIT Delhi EMAIL,EMAIL
Pseudocode Yes Algorithm 1 Calculate minimum token distance between two n-grams
Open Source Code Yes 1https://github.com/huihanlhh/Culture Gen Attr
Open Datasets Yes We conduct all of our analysis on OLMo-7B (Groeneveld et al., 2024) and its pretraining corpora Dolma (Soldaini et al., 2024), as OLMo-7B is the most capable generative large language model with open-sourced and indexed pretraining data.
Dataset Splits No The paper analyzes outputs from a pre-trained model and generates data for its analysis. It describes sampling 100 generations for male, female, and gender-agnostic settings, resulting in 300 generations per culture, but does not provide traditional training/test/validation splits for a dataset used in training or evaluation of a new model.
Hardware Specification No The paper analyzes outputs from OLMo-7B and its pretraining data but does not specify the hardware (e.g., GPU models, CPU types) used by the authors to conduct their analysis or generate these outputs.
Software Dependencies No The paper mentions software tools like huggingface, LLAMA-3-70b-instruct, OLMo-Instruct-7B, LDA, and XLM-RoBERTa-large embeddings, but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes We use the default model implementations from huggingface, setting temperature=1.0, top p=0.95, top k=50, max tokens=30 and num return sequences=100, and period ( . ) as the stopping criteria. ... Therefore, we set the threshold of z-score to 2.6 (> 99.5% of CG3) to find outliers in the distribution and classify the symbols as memorized for cultures whose z-score is above the threshold.