Data Valuation in the Absence of a Reliable Validation Set
Authors: Himanshu Jahagirdar, Jiachen T. Wang, Ruoxi Jia
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical Evaluation. We demonstrate the effectiveness of LOOCV-based data valuation techniques on important downstream tasks. Compared with validation-based techniques, we show that LOOCV-based data valuation techniques achieve comparable performance on the weighted accuracy task and (often) superior performance on noisy label detection task. We also show that RLS with Gaussian kernel as a proxy model is an effective proxy model for valuation, the computed data value scores have a better performance on these downstream tasks than the validation-based counterparts. |
| Researcher Affiliation | Academia | Himanshu Jahagirdar EMAIL Virginia Tech Jiachen T. Wang EMAIL Princeton University Ruoxi Jia EMAIL Virginia Tech |
| Pseudocode | No | The paper describes methods using mathematical formulations and prose, but does not contain any clearly labeled pseudocode or algorithm blocks. For example, Section 4 details the proposed approach and its components but without a dedicated algorithm box. |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code, nor does it include links to a code repository in the main text or supplementary materials. Phrases like 'We release our code...' or links to GitHub are absent. |
| Open Datasets | Yes | We evaluate data values over 9 classification datasets popularly used in data valuation literature (refer Appendix B.1). For example, the paper mentions using the 'Census Dataset from the UCI Repository (Dua & Graff, 2017)', 'Credit Card Data (Yeh & Lien, 2009)', 'CIFAR10', and 'MNIST'. |
| Dataset Splits | Yes | A validation-free paradigm for data valuation using Leave-One-Out Cross-Validation (LOOCV). Recognizing the limitations of validation-based data valuation techniques, we propose a novel validation-free approach using Leave-One-Out Cross-Validation (LOOCV) to estimate performance scores on the population. Cross-Validation (CV) is a widely-used technique in statistical machine learning for estimating the generalizability of a trained model to the population distribution. In a K-fold CV, data is randomly partitioned into K equal-sized subsets. The model is trained on K 1 subsets and tested on the remaining one, repeating this process K times and averaging the validation performance over the remaining subset. Leave-one-out cross-validation (LOOCV) is a special case of K-fold where K equals the total sample size. That is, it trains the model on all data points except one, and repeats this for each data point. |
| Hardware Specification | No | The paper makes a general statement about future potential, 'Additionally, it opens the potential for parallel computation of f i for all i via GPU operations', but does not specify any actual hardware (like CPU, GPU models, or cloud configurations) used for the experiments presented in the paper. |
| Software Dependencies | No | The paper mentions using 'standard models (either binary MLP or logistic regression)' and an 'RLS model' but does not specify any software frameworks, libraries, or their version numbers (e.g., 'PyTorch 1.9', 'Scikit-learn 0.24') that would be necessary for reproducibility. |
| Experiment Setup | Yes | LOOCV calculation (outlined in Theorem 2) involves computing the efficient cross-validation accuracy (using Theorem 5) on an RLS model ( = 0.1) with a Gaussian Kernel. Additionally, we perform an ablation study on the effect of changing parameter in Appendix B.6. In all experiments, labels have been randomly flipped with a fixed poison ratio of 10%. |