reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SAVA: Scalable Learning-Agnostic Data Valuation

Authors: Samuel Kessler, Tam Le, Vu Nguyen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform extensive experiments, to demonstrate that SAVA can scale to large datasets with millions of data points and does not trade off data valuation performance.
Researcher Affiliation	Collaboration	Samuel Kessler Microsoft EMAIL Tam Le The Institute of Statistical Mathematics / RIKEN AIP EMAIL Vu Nguyen Amazon EMAIL
Pseudocode	Yes	Algorithm 1 Scalable Data Valuation (SAVA) algorithm. More concretely, in Lines 1 5, we solve multiple OT problems between batches. In Line 6, we solve the OT problem across batches: OT C( µt, µv), to obtain π ( µt, µv). In Lines 7 10, we estimate valuation scores for training data using the plan π ( µt, µv) and potentials f (µBi, µB j) computed in the previous steps.
Open Source Code	Yes	Our code is available at https://github.com/skezle/sava.
Open Datasets	Yes	We test the scalability of SAVA versus LAVA (Just et al., 2023) by leveraging the CIFAR10 dataset, introducing a corruption to a percentage of the training data, but keeping the validation set clean. We consider the web-scrapped dataset Clothing1M (Xiao et al., 2015) where the training set has over 1M images whose labels are noisy and unreliable.
Dataset Splits	Yes	We sort training examples by the highest OT gradients in Eq. (6) and Eq. (12) for LAVA and SAVA respectively, and use the fraction of corrupted data recovered for a prefix of size N/4 as the detection rate (where N is the training set size). We split the CIFAR10 dataset into 5 equally sized partitions with all classes, and we incrementally build up the dataset such that it grows in size as one would train a production system.
Hardware Specification	Yes	All experiments are run on a Tesla K80 Nvidia GPUs with 12GB GPU RAM. We use a single Nvidia K80 GPU to run all experiments. For all experiments, we use a node with 8 Nvidia K80 GPUs.
Software Dependencies	No	The paper mentions using a pre-trained ResNet18 model and various optimizers (SGD, Adam) but does not provide specific version numbers for software libraries, programming languages, or operating systems used for implementation.
Experiment Setup	Yes	For the pruning experiments we greedily prune N/4 of the ranked points with the lowest value; the highest gradient of the OT for SAVA and LAVA. We then train a Res Net18 with the SGD optimizer with weight decay of 5 * 10^-4 and momentum of 0.9 for 200 epochs with a learning rate schedule where for the first 100 epochs the learning rate is 0.1, then for the next 50 epochs the learning rate is 0.01, then the final 50 epochs the learning rate is 0.001. We use an Adam optimizer with a weight decay of 0.002. Since the pruned datasets can be of different sizes depending on the amount of pruning. We train for a fixed number of 100k steps. We use a learning rate schedule where for the first 30k steps the learning rate is 0.1 then the next 30k steps the learning rate is 0.05 then the next 20k steps the learning rate is 0.01, then the next 10k steps the learning rate is 0.001, then the next 5k steps the learning rate is 0.0001 then for the final 5k steps the learning rate is 0.00001. We use a batch size of 2048 for valuation and we use label-2-label matrix caching (Appendix H).