Attributing Culture-Conditioned Generations to Pretraining Corpora
Authors: Huihan Li, Arnav Goel, Keyu He, Xiang Ren
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using MEMOED on culture-conditioned generations about food and clothing for 110 cultures, we find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none. |
| Researcher Affiliation | Academia | 1University of Southern California 2IIIT Delhi EMAIL,EMAIL |
| Pseudocode | Yes | Algorithm 1 Calculate minimum token distance between two n-grams |
| Open Source Code | Yes | 1https://github.com/huihanlhh/Culture Gen Attr |
| Open Datasets | Yes | We conduct all of our analysis on OLMo-7B (Groeneveld et al., 2024) and its pretraining corpora Dolma (Soldaini et al., 2024), as OLMo-7B is the most capable generative large language model with open-sourced and indexed pretraining data. |
| Dataset Splits | No | The paper analyzes outputs from a pre-trained model and generates data for its analysis. It describes sampling 100 generations for male, female, and gender-agnostic settings, resulting in 300 generations per culture, but does not provide traditional training/test/validation splits for a dataset used in training or evaluation of a new model. |
| Hardware Specification | No | The paper analyzes outputs from OLMo-7B and its pretraining data but does not specify the hardware (e.g., GPU models, CPU types) used by the authors to conduct their analysis or generate these outputs. |
| Software Dependencies | No | The paper mentions software tools like huggingface, LLAMA-3-70b-instruct, OLMo-Instruct-7B, LDA, and XLM-RoBERTa-large embeddings, but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | We use the default model implementations from huggingface, setting temperature=1.0, top p=0.95, top k=50, max tokens=30 and num return sequences=100, and period ( . ) as the stopping criteria. ... Therefore, we set the threshold of z-score to 2.6 (> 99.5% of CG3) to find outliers in the distribution and classify the symbols as memorized for cultures whose z-score is above the threshold. |