SAVA: Scalable Learning-Agnostic Data Valuation
Authors: Samuel Kessler, Tam Le, Vu Nguyen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments, to demonstrate that SAVA can scale to large datasets with millions of data points and does not trade off data valuation performance. |
| Researcher Affiliation | Collaboration | Samuel Kessler Microsoft EMAIL Tam Le The Institute of Statistical Mathematics / RIKEN AIP EMAIL Vu Nguyen Amazon EMAIL |
| Pseudocode | Yes | Algorithm 1 Scalable Data Valuation (SAVA) algorithm. More concretely, in Lines 1 5, we solve multiple OT problems between batches. In Line 6, we solve the OT problem across batches: OT C( µt, µv), to obtain π ( µt, µv). In Lines 7 10, we estimate valuation scores for training data using the plan π ( µt, µv) and potentials f (µBi, µB j) computed in the previous steps. |
| Open Source Code | Yes | Our code is available at https://github.com/skezle/sava. |
| Open Datasets | Yes | We test the scalability of SAVA versus LAVA (Just et al., 2023) by leveraging the CIFAR10 dataset, introducing a corruption to a percentage of the training data, but keeping the validation set clean. We consider the web-scrapped dataset Clothing1M (Xiao et al., 2015) where the training set has over 1M images whose labels are noisy and unreliable. |
| Dataset Splits | Yes | We sort training examples by the highest OT gradients in Eq. (6) and Eq. (12) for LAVA and SAVA respectively, and use the fraction of corrupted data recovered for a prefix of size N/4 as the detection rate (where N is the training set size). We split the CIFAR10 dataset into 5 equally sized partitions with all classes, and we incrementally build up the dataset such that it grows in size as one would train a production system. |
| Hardware Specification | Yes | All experiments are run on a Tesla K80 Nvidia GPUs with 12GB GPU RAM. We use a single Nvidia K80 GPU to run all experiments. For all experiments, we use a node with 8 Nvidia K80 GPUs. |
| Software Dependencies | No | The paper mentions using a pre-trained ResNet18 model and various optimizers (SGD, Adam) but does not provide specific version numbers for software libraries, programming languages, or operating systems used for implementation. |
| Experiment Setup | Yes | For the pruning experiments we greedily prune N/4 of the ranked points with the lowest value; the highest gradient of the OT for SAVA and LAVA. We then train a Res Net18 with the SGD optimizer with weight decay of 5 * 10^-4 and momentum of 0.9 for 200 epochs with a learning rate schedule where for the first 100 epochs the learning rate is 0.1, then for the next 50 epochs the learning rate is 0.01, then the final 50 epochs the learning rate is 0.001. We use an Adam optimizer with a weight decay of 0.002. Since the pruned datasets can be of different sizes depending on the amount of pruning. We train for a fixed number of 100k steps. We use a learning rate schedule where for the first 30k steps the learning rate is 0.1 then the next 30k steps the learning rate is 0.05 then the next 20k steps the learning rate is 0.01, then the next 10k steps the learning rate is 0.001, then the next 5k steps the learning rate is 0.0001 then for the final 5k steps the learning rate is 0.00001. We use a batch size of 2048 for valuation and we use label-2-label matrix caching (Appendix H). |