reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Top-m Data Values Identification for Data Selection

Authors: Xiaoqiang Lin, Xinyi Xu, See-Kiong Ng, Bryan Kian Hsiang Low

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that GPGap E outperforms other baselines in top-m data values identification, noisy data detection, and data subset selection on real-world datasets. We also demonstrate the efficiency of our GPGap E in data selection for large language model fine-tuning. We perform experiments on top-m data values identification with closed-form SV. We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We perform experiments using top-m data values to select a data subset of size m, following the setting in Sec. 5.2. We use the TYDIQA dataset (Clark et al., 2020) and LLAMA-2-7B model (Touvron et al., 2023). Figures 2, 3, 4, 5, 6, 7, 8 and Tables 1, 2, 3 present empirical results.
Researcher Affiliation	Academia	1Department of Computer Science, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	The pseudo-code for GPGap E is in Algorithm 1. Algorithm 1 GPGap E for top-m data values identification
Open Source Code	Yes	Our code is available at https://github.com/xqlin98/data-selection-efficient-topm
Open Datasets	Yes	We use the MNIST dataset with 10k data points... We perform noisy data detection on MNIST, Fashion MNIST, and CIFAR10. We use the TYDIQA dataset (Clark et al., 2020)... A.1 LICENSE FOR DATASETS MNIST (Le Cun et al., 1990): Attribution-Share Alike 3.0 License; CIFAR10 (Krizhevsky, 2009): MIT License; Fashion MNIST (Xiao et al., 2017): MIT License.
Dataset Splits	Yes	We use the MNIST dataset with 10k data points in the training dataset and 10k data points in the validation dataset. We select 3k data points from each dataset to perform top-m identification. We randomly sample 1k data points as the validation dataset to further accelerate the utility evaluation (i.e., computation of model accuracy).
Hardware Specification	Yes	A.2 COMPUTATIONAL RESOURCES Experiments are run on a server with AMD EPYC 7763 64-Core Processor, 1008GB RAM, and 8 NVIDIA L40 GPUs.
Software Dependencies	No	We use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... we use the Sentence-BERT (Reimers & Gurevych, 2019) as the embedding model... We use the model from https://huggingface.co/sentence-transformers/ all-Mini LM-L6-v2
Experiment Setup	Yes	For training MLP on MNIST, we set the learning rate to be 0.01, and the number of epochs to be 10. For training MLP on Fashion MNIST and CNN on CIFAR10, the learning rate is 0.001, and the number of epochs is 30. We use a batch size of 200 and use Adam optimizer (Kingma & Ba, 2014) for all NN training. We use the RBF kernel for GP... We set p = 0.1n in GPGap E-small. We set p = 0.3n in GPGap E-small. The length scale parameter of RBF is searched over [0.5, 1, 10] and the noise parameter λ = [1, 5, 10].