Efficient Top-m Data Values Identification for Data Selection

Authors: Xiaoqiang Lin, Xinyi Xu, See-Kiong Ng, Bryan Kian Hsiang Low

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that GPGap E outperforms other baselines in top-m data values identification, noisy data detection, and data subset selection on real-world datasets. We also demonstrate the efficiency of our GPGap E in data selection for large language model fine-tuning. We perform experiments on top-m data values identification with closed-form SV. We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We perform experiments using top-m data values to select a data subset of size m, following the setting in Sec. 5.2. We use the TYDIQA dataset (Clark et al., 2020) and LLAMA-2-7B model (Touvron et al., 2023). Figures 2, 3, 4, 5, 6, 7, 8 and Tables 1, 2, 3 present empirical results.
Researcher Affiliation Academia 1Department of Computer Science, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes The pseudo-code for GPGap E is in Algorithm 1. Algorithm 1 GPGap E for top-m data values identification
Open Source Code Yes Our code is available at https://github.com/xqlin98/data-selection-efficient-topm
Open Datasets Yes We use the MNIST dataset with 10k data points... We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We use the TYDIQA dataset (Clark et al., 2020)... A.1 LICENSE FOR DATASETS MNIST (Le Cun et al., 1990): Attribution-Share Alike 3.0 License; CIFAR10 (Krizhevsky, 2009): MIT License; Fashion MNIST (Xiao et al., 2017): MIT License.
Dataset Splits Yes We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We select 3k data points from each dataset to perform top-m identification. We randomly sample 1k data points as the validation dataset to further accelerate the utility evaluation (i.e., computation of model accuracy).
Hardware Specification Yes A.2 COMPUTATIONAL RESOURCES Experiments are run on a server with AMD EPYC 7763 64-Core Processor, 1008GB RAM, and 8 NVIDIA L40 GPUs.
Software Dependencies No We use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... we use the Sentence-BERT (Reimers & Gurevych, 2019) as the embedding model... We use the model from https://huggingface.co/sentence-transformers/ all-Mini LM-L6-v2
Experiment Setup Yes For training MLP on MNIST, we set the learning rate to be 0.01, and the number of epochs to be 10. For training MLP on Fashion MNIST and CNN on CIFAR10, the learning rate is 0.001, and the number of epochs is 30. We use a batch size of 200 and use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... We set p = 0.1n in GPGap E-small. We set p = 0.3n in GPGap E-small. The length scale parameter of RBF is searched over [0.5, 1, 10] and the noise parameter λ = [1, 5, 10].