DRoP: Distributionally Robust Data Pruning
Authors: Artem Vysogorets, Kartik Ahuja, Julia Kempe
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis... We thus propose DRo P, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. |
| Researcher Affiliation | Collaboration | Artem Vysogorets Data Science Platform Rockefeller University EMAIL Kartik Ahuja Meta FAIR Julia Kempe New York University Meta FAIR |
| Pseudocode | Yes | Algorithm 1: DRo P Input: Target dataset density d [0, 1]. For each class k [K]: original size Nk, validation recall rk [0, 1]. Initialize: Unsaturated set of classes U [K], excess E d N, class densities dk 0 k [K]. while E > 0 do k U Nk(1 rk); for k U do d k (1 rk)/Z; dk dk + d k; E E Nkd k; if dk > 1 then U U \ {k}; E E + Nk(dk 1); dk 1 end end end Return :{dk}K k=1. |
| Open Source Code | Yes | We make our code available at https://github.com/avysogorets/ drop-data-pruning. |
| Open Datasets | Yes | Our empirical work encompasses three standard computer vision benchmarks (Table 1). ... VGG-16 and VGG-19 (Simonyan & Zisserman, 2015) on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), respectively, Res Net-18 (He et al., 2016) on Tiny Image Net (MIT License) (Le & Yang, 2015), Image Net pre-trained Res Net-50 on Waterbirds (Sagawa* et al., 2020) (MIT License), and Res Net-50 on Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | Since some of the methods require a hold-out validation set (e.g., DRo P, CDB-W), we reserve 50% of the test set for this purpose. This split is never used when reporting the final model performance. (Also implied by usage of standard benchmark datasets like CIFAR-10, ImageNet). |
| Hardware Specification | Yes | All code is implemented in Py Torch (Paszke et al., 2017) and run on an internal cluster equipped with NVIDIA RTX8000 GPUs. |
| Software Dependencies | No | All code is implemented in Py Torch (Paszke et al., 2017) and run on an internal cluster equipped with NVIDIA RTX8000 GPUs. This mentions PyTorch but does not provide a specific version number (e.g., PyTorch 1.x or 2.x). |
| Experiment Setup | Yes | Table 1: Summary of experimental work and hyperparameters. All architectures include batch normalization (Ioffe & Szegedy, 2015) layers followed by Re LU activations. Models are initialized with Kaiming normal (He et al., 2015) and optimized by SGD (momentum 0.9) with a stepwise LR schedule (0.2 drop factor applied on specified Drop Epochs) and categorical cross-entropy. |