reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Pruning by Information Maximization

Authors: Haoru Tan, Sitong Wu, Wei Huang, Shizhen Zhao, XIAOJUAN QI

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the superior performance of Info Max in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.
Researcher Affiliation	Academia	Haoru Tan 1, Sitong Wu 2, Wei Huang1, Shizhen Zhao1, Xiaojuan Qi: 1 1The University of Hong Kong 2The Chinese University of Hong Kong
Pseudocode	Yes	Algorithm 1: Info Max Coreset Selection.
Open Source Code	No	The paper does not contain any explicit statement about providing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	The image classification task encompasses experiments on three datasets, namely CIFAR-10, CIFAR100 (Krizhevsky, 2009), and Imagenet-1K (Russakovsky et al., 2015). For multi-modality pretraining tasks, we conducted experiments on the popular vision-language dataset CC12M (Changpinyo et al., 2021)... Following (Xia et al., 2024), we also conduct coreset selection on a mixed dataset containing data from FLAN-V2 (Longpre et al., 2023), COT (Wei et al., 2022), DOLLY (Conover et al., 2023), OPEN-ASSISTANT-1 (Köpf et al., 2023).
Dataset Splits	No	The paper describes a dataset partitioning strategy for its Info Max algorithm (dividing original data into smaller random subsets for processing) but does not provide specific training/validation/test splits (e.g., percentages or counts) for the benchmark datasets used (CIFAR, ImageNet, etc.) for model evaluation, nor does it explicitly state the use of standard splits for these benchmarks.
Hardware Specification	Yes	Our experiments were conducted on a server equipped with 8 Tesla-V100 GPUs. For coreset selection on the vision-language dataset CC12M (Changpinyo et al., 2021), all experiments are conducted on 2 servers with a totally of 16 NVIDIA V100 GPUs. This experiments is conducted on a server with 8 A100 GPUs.
Software Dependencies	No	We utilized Pytorch Paszke et al. (2017) to implement our method. The paper mentions Pytorch but does not specify a version number for it or any other key software components.
Experiment Setup	Yes	For CIFAR-100, we utilize the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.1, and a batch size of 128. For Tiny Image Net, we use the SGD optimizer with weight-decay set to 5e-4, a learning rate of 0.3, and a batch size of 64. For Image Net-1K, we use the SGD optimizer with weight-decay set to 1e-4, warmup for 5 epochs, a learning rate of 0.4, and a batch size of 256. Regarding data augmentation, we solely adopt Random Resized Crop and Random Horizontal Flip for all experiments. Specifically, the CLIP model Radford et al. (2021) is trained for 32 epochs with Adam W optimizer, weight decay 0.2, and a batch size of 2048. After 1 warmup epoch, the learning rate gradually decreases from 1e-4 following the cosine strategy. The specific settings for Lo RA fine-tuning are as follows: the Lora-rank is 64, bf-16 precision is used, the number of epochs is 4, the Lora-target-modules include q-proj, k-proj, v-proj, o-proj, the learning rate is 1e 05, the batch size is 8, the gradient accumulation steps is 16, and the Adam W optimizer is used.