Capturing the Temporal Dependence of Training Data Influence

Authors: Jiachen (Tianhao) Wang, Dawn Song, James Y Zou, Prateek Mittal, Ruoxi Jia

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the effectiveness of our proposed data value embedding method. First, we assess its fidelity in accurately reflecting data importance using small-scale experimental setups (Section 5.1), as well as its computational efficiency (Section 5.2). We then apply data value embedding to analyze the training dynamics during foundation model pretraining (Section 5.3 and Appendix E.4). The implementation details and additional results are deferred to Appendix E.
Researcher Affiliation Academia Jiachen T. Wang Princeton University Dawn Song UC Berkeley James Zou Stanford University Prateek Mittal Princeton University Ruoxi Jia Virginia Tech Correspondence to Jiachen T. Wang and Ruoxi Jia (EMAIL, EMAIL).
Pseudocode Yes Algorithm 1 Backpropagation for computing data value embedding from the final checkpoint Algorithm 2 Parallel Influence Checkpointing for Data Value Embedding
Open Source Code No The paper does not contain any explicit statements about the release of source code or links to a code repository for their methodology.
Open Datasets Yes We conduct our experiments on the MNIST (Le Cun et al., 1989)... For Pythia-410M trained on 1% of the Pile dataset... with Pythia-410M trained on 1% of Pile dataset as an example... For both settings, the sequence length is set to 1024. The learning rate is set at a maximum of 3 10 4. We use Adam W as the optimizer with a weight decay of 0.1, and beta values set to 0.9 and 0.95. Gradients are clipped at a maximum value of 1.0 to maintain stability during training. The batch size is set to 16, with a learning rate warmup of 2000 iterations followed by cosine decay.
Dataset Splits No The paper mentions training Pythia-410M on "1% of Pile" and using "a subset of 1,000 samples from CIFAR-10 dataset" with "10% random label noise". It also discusses using a "validation batch sampled from Pile". However, it does not provide specific, reproducible training, validation, and test splits (e.g., percentages, exact counts, or specific predefined splits) for the datasets used in its experiments.
Hardware Specification Yes The experiment is conducted on one A100 GPU with 80GB VRAM.
Software Dependencies No The paper mentions using "standard SGD" and "Adam W as the optimizer" but does not specify software versions for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries with version numbers.
Experiment Setup Yes To validate the effectiveness of our proposed data value embedding algorithm, we assess its accuracy in approximating TSLOO scores... we conduct our experiments on the MNIST (Le Cun et al., 1989) using a small MLP trained with standard SGD. We consider two settings: (1) Single epoch removal, where a data point is excluded from training during a single epoch but still in other training epochs. Here, we remove the data point from the last epoch. (2) All-epoch removal, where a data point is excluded in all epochs. In this case, the approximation provided by data value embedding is obtained by summing the data value embeddings of the data point from all epochs, as discussed in Appendix C.10.