Data Pruning by Information Maximization

Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, XIAOJUAN QI

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superior performance of Info Max in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.
Researcher Affiliation Academia Haoru Tan 1, Sitong Wu 2, Wei Huang1, Shizhen Zhao1, Xiaojuan Qi: 1 1The University of Hong Kong 2The Chinese University of Hong Kong
Pseudocode Yes Algorithm 1: Info Max Coreset Selection.
Open Source Code No The paper does not contain any explicit statement about providing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes The image classification task encompasses experiments on three datasets, namely CIFAR-10, CIFAR100 (Krizhevsky, 2009), and Imagenet-1K (Russakovsky et al., 2015). For multi-modality pretraining tasks, we conducted experiments on the popular vision-language dataset CC12M (Changpinyo et al., 2021)... Following (Xia et al., 2024), we also conduct coreset selection on a mixed dataset containing data from FLAN-V2 (Longpre et al., 2023), COT (Wei et al., 2022), DOLLY (Conover et al., 2023), OPEN-ASSISTANT-1 (Köpf et al., 2023).
Dataset Splits No The paper describes a dataset partitioning strategy for its Info Max algorithm (dividing original data into smaller random subsets for processing) but does not provide specific training/validation/test splits (e.g., percentages or counts) for the benchmark datasets used (CIFAR, ImageNet, etc.) for model evaluation, nor does it explicitly state the use of standard splits for these benchmarks.
Hardware Specification Yes Our experiments were conducted on a server equipped with 8 Tesla-V100 GPUs. For coreset selection on the vision-language dataset CC12M (Changpinyo et al., 2021), all experiments are conducted on 2 servers with a totally of 16 NVIDIA V100 GPUs. This experiments is conducted on a server with 8 A100 GPUs.
Software Dependencies No We utilized Pytorch Paszke et al. (2017) to implement our method. The paper mentions Pytorch but does not specify a version number for it or any other key software components.
Experiment Setup Yes For CIFAR-100, we utilize the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.1, and a batch size of 128. For Tiny Image Net, we use the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.3, and a batch size of 64. For Image Net-1K, we use the SGD optimizer with weight-decay set to 1e-4, warmup for 5 epochs, a learning rate of 0.4, and a batch size of 256. Regarding data augmentation, we solely adopt Random Resized Crop and Random Horizontal Flip for all experiments. Specifically, the CLIP model Radford et al. (2021) is trained for 32 epochs with Adam W optimizer, weight decay 0.2, and a batch size of 2048. After 1 warmup epoch, the learning rate gradually decreases from 1e-4 following the cosine strategy. The specific settings for Lo RA fine-tuning are as follows: the Lora-rank is 64, bf-16 precision is used, the number of epochs is 4, the Lora-target-modules include q-proj, k-proj, v-proj, o-proj, the learning rate is 1e 05, the batch size is 8, the gradient accumulation steps is 16, and the Adam W optimizer is used.