Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Data Cleansing for Models Trained with SGD

Authors: Satoshi Hara, Atsushi Nitanda, Takanori Maehara

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.
Researcher Affiliation Academia EMAIL, Osaka University, Japan EMAIL, The University of Tokyo, Japan EMAIL, RIKEN AIP, Japan
Pseudocode Yes Algorithm 1 LIE for SGD: Training Phase" and "Algorithm 2 LIE for SGD: Inference Phase" are presented on page 4.
Open Source Code Yes The codes are available at https://github.com/sato9hara/sgd-influence
Open Datasets Yes We used three datasets: Adult [Dua and Karra Taniskidou, 2017], 20Newsgroups, and MNIST [Le Cun et al., 1998]." and "We used MNIST and CIFAR10 [Krizhevsky and Hinton, 2009].
Dataset Splits Yes In the experiments, we randomly subsampled 200 instances for the training set D and validation set D0." and "From the original training set, we held out randomly selected 10,000 instances for the validation set and used the remaining instances as the training set.
Hardware Specification Yes The experiments were conducted on 64bit Ubuntu 16.04 with six Intel Xeon E5-1650 3.6GHz CPU, 128GB RAM, and four Ge Force GTX 1080ti.
Software Dependencies Yes We used Python 3 and PyTorch 1.0 for the experiments.
Experiment Setup Yes In SGD, we set the epoch K = 20, batch size |St| = 64, and learning rate t = 0.05.