Data Shapley in One Training Run

Authors: Jiachen (Tianhao) Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present several case studies that offer fresh insights into pretraining data s contribution and discuss their implications for copyright in generative AI and pretraining data curation. ... We performed various case studies that provide fresh insights into training data s contribution to the foundation model pretraining. ... We empirically assess the computational efficiency of In-Run Data Shapley with "ghost dot-product" and "ghost vector-Hessian-vector product" techniques developed in Section 4.2. ... In this section, we directly assess the approximation accuracy of first- and second-order In-Run Data Shapley. ... In this section, we present a case study to demonstrate the use cases of In-Run Data Shapley by pretraining on the well-known Pile dataset (Gao et al., 2020).
Researcher Affiliation Academia Jiachen T. Wang Princeton University Prateek Mittal Princeton University Dawn Song UC Berkeley Ruoxi Jia Virginia Tech
Pseudocode No The paper describes the algorithm steps in Section 3 and 4 with mathematical formulations and textual explanations but does not include a dedicated pseudocode or algorithm block.
Open Source Code No Our codebase is adapted from https://github.com/karpathy/nanoGPT/tree/master. We first tokenize and split the entire dataset into chunks, and store them in the disk in numpy array format, which significantly speeds up data loading.
Open Datasets Yes In this section, we present a case study to demonstrate the use cases of In-Run Data Shapley by pretraining on the well-known Pile dataset (Gao et al., 2020).
Dataset Splits Yes In this experiment, we first conduct one training run for 20,000 iterations. Among all the corpus that has been used in the training, we compute their In-Run Data Shapley and filter out all corpus among this subset that has negative contributions to model training. After filtering out the 16% negative valued corpus, we train another model on the remaining dataset for 10,000 iterations with all hyperparameters staying the same. ... We use a subset of 1,000 samples from CIFAR10 and ResNet18 architecture. ... for each training run we randomly sample a size-1000 subset of CIFAR10 dataset (with 10% data points being mislabeled).
Hardware Specification Yes The experiment is conducted by training GPT2-Small on a single 80GB A100 GPU.
Software Dependencies No The paper mentions using GPT2 and Pythia-410M models and AdamW optimizer, but it does not specify version numbers for Python, PyTorch, TensorFlow, or other key software libraries.
Experiment Setup Yes The maximum sequence length is set to 1024. The learning rate is set at a maximum of 0.0006, with a minimum learning rate of 0.00006. We use AdamW as the optimizer with a weight decay of 0.1, and beta values set to 0.9 and 0.95. Gradients are clipped at a maximum value of 1.0 to maintain stability during training. The batch size is set to 16, with a learning rate warmup of 2000 iterations. Due to the shortage of computation resources, we stop the training at 500,000 iterations.