Gradient Methods Never Overfit On Separable Data

Authors: Ohad Shamir

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data... We conclude our paper by providing a numerical experiment supporting our theoretical results on a real-world dataset.
Researcher Affiliation Academia Ohad Shamir EMAIL Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel
Pseudocode No The paper describes methods using mathematical equations and text, but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of source code for the described methodology.
Open Datasets Yes Specifically, we used a linearly-separable subset of the well-known MNIST digit recognition dataset, focusing on the binary classification task of distinguishing between the letters 4 and 7 using a linear predictor.
Dataset Splits Yes Given this linearly-separable dataset, we randomly split it into a training set of size 10^4, and a test set.
Hardware Specification No The paper describes a numerical experiment but does not provide specific details about the hardware used to run these experiments.
Software Dependencies No The paper mentions using gradient descent with logistic loss and a quadratic programming solver but does not specify any software versions for reproducibility.
Experiment Setup Yes Then, we trained an unbiased linear classifier using gradient descent (with step size 1 and using the logistic loss) until all but 212 examples were misclassified. After removing these examples, we were left with a dataset of 13905 examples, which by definition is linearly separable. Using a quadratic programming solver, we computed the exact max-margin predictor w , with a margin of γ := 0.0187... Given this linearly-separable dataset, we randomly split it into a training set of size 10^4, and a test set. We again ran gradient descent (with the logistic loss and step size of 1) on the training set.