Efficient Top-m Data Values Identification for Data Selection
Authors: Xiaoqiang Lin, Xinyi Xu, See-Kiong Ng, Bryan Kian Hsiang Low
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that GPGap E outperforms other baselines in top-m data values identification, noisy data detection, and data subset selection on real-world datasets. We also demonstrate the efficiency of our GPGap E in data selection for large language model fine-tuning. We perform experiments on top-m data values identification with closed-form SV. We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We perform experiments using top-m data values to select a data subset of size m, following the setting in Sec. 5.2. We use the TYDIQA dataset (Clark et al., 2020) and LLAMA-2-7B model (Touvron et al., 2023). Figures 2, 3, 4, 5, 6, 7, 8 and Tables 1, 2, 3 present empirical results. |
| Researcher Affiliation | Academia | 1Department of Computer Science, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | The pseudo-code for GPGap E is in Algorithm 1. Algorithm 1 GPGap E for top-m data values identification |
| Open Source Code | Yes | Our code is available at https://github.com/xqlin98/data-selection-efficient-topm |
| Open Datasets | Yes | We use the MNIST dataset with 10k data points... We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We use the TYDIQA dataset (Clark et al., 2020)... A.1 LICENSE FOR DATASETS MNIST (Le Cun et al., 1990): Attribution-Share Alike 3.0 License; CIFAR10 (Krizhevsky, 2009): MIT License; Fashion MNIST (Xiao et al., 2017): MIT License. |
| Dataset Splits | Yes | We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We select 3k data points from each dataset to perform top-m identification. We randomly sample 1k data points as the validation dataset to further accelerate the utility evaluation (i.e., computation of model accuracy). |
| Hardware Specification | Yes | A.2 COMPUTATIONAL RESOURCES Experiments are run on a server with AMD EPYC 7763 64-Core Processor, 1008GB RAM, and 8 NVIDIA L40 GPUs. |
| Software Dependencies | No | We use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... we use the Sentence-BERT (Reimers & Gurevych, 2019) as the embedding model... We use the model from https://huggingface.co/sentence-transformers/ all-Mini LM-L6-v2 |
| Experiment Setup | Yes | For training MLP on MNIST, we set the learning rate to be 0.01, and the number of epochs to be 10. For training MLP on Fashion MNIST and CNN on CIFAR10, the learning rate is 0.001, and the number of epochs is 30. We use a batch size of 200 and use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... We set p = 0.1n in GPGap E-small. We set p = 0.3n in GPGap E-small. The length scale parameter of RBF is searched over [0.5, 1, 10] and the noise parameter λ = [1, 5, 10]. |