Data Shapley in One Training Run
Authors: Jiachen (Tianhao) Wang, Prateek Mittal, Dawn Song, Ruoxi Jia
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present several case studies that offer fresh insights into pretraining data s contribution and discuss their implications for copyright in generative AI and pretraining data curation. ... We performed various case studies that provide fresh insights into training data s contribution to the foundation model pretraining. ... We empirically assess the computational efficiency of In-Run Data Shapley with "ghost dot-product" and "ghost vector-Hessian-vector product" techniques developed in Section 4.2. ... In this section, we directly assess the approximation accuracy of first- and second-order In-Run Data Shapley. ... In this section, we present a case study to demonstrate the use cases of In-Run Data Shapley by pretraining on the well-known Pile dataset (Gao et al., 2020). |
| Researcher Affiliation | Academia | Jiachen T. Wang Princeton University Prateek Mittal Princeton University Dawn Song UC Berkeley Ruoxi Jia Virginia Tech |
| Pseudocode | No | The paper describes the algorithm steps in Section 3 and 4 with mathematical formulations and textual explanations but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | No | Our codebase is adapted from https://github.com/karpathy/nanoGPT/tree/master. We first tokenize and split the entire dataset into chunks, and store them in the disk in numpy array format, which significantly speeds up data loading. |
| Open Datasets | Yes | In this section, we present a case study to demonstrate the use cases of In-Run Data Shapley by pretraining on the well-known Pile dataset (Gao et al., 2020). |
| Dataset Splits | Yes | In this experiment, we first conduct one training run for 20,000 iterations. Among all the corpus that has been used in the training, we compute their In-Run Data Shapley and filter out all corpus among this subset that has negative contributions to model training. After filtering out the 16% negative valued corpus, we train another model on the remaining dataset for 10,000 iterations with all hyperparameters staying the same. ... We use a subset of 1,000 samples from CIFAR10 and ResNet18 architecture. ... for each training run we randomly sample a size-1000 subset of CIFAR10 dataset (with 10% data points being mislabeled). |
| Hardware Specification | Yes | The experiment is conducted by training GPT2-Small on a single 80GB A100 GPU. |
| Software Dependencies | No | The paper mentions using GPT2 and Pythia-410M models and AdamW optimizer, but it does not specify version numbers for Python, PyTorch, TensorFlow, or other key software libraries. |
| Experiment Setup | Yes | The maximum sequence length is set to 1024. The learning rate is set at a maximum of 0.0006, with a minimum learning rate of 0.00006. We use AdamW as the optimizer with a weight decay of 0.1, and beta values set to 0.9 and 0.95. Gradients are clipped at a maximum value of 1.0 to maintain stability during training. The batch size is set to 16, with a learning rate warmup of 2000 iterations. Due to the shortage of computation resources, we stop the training at 500,000 iterations. |