Gradient Methods Never Overfit On Separable Data
Authors: Ohad Shamir
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data... We conclude our paper by providing a numerical experiment supporting our theoretical results on a real-world dataset. |
| Researcher Affiliation | Academia | Ohad Shamir EMAIL Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel |
| Pseudocode | No | The paper describes methods using mathematical equations and text, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of source code for the described methodology. |
| Open Datasets | Yes | Specifically, we used a linearly-separable subset of the well-known MNIST digit recognition dataset, focusing on the binary classification task of distinguishing between the letters 4 and 7 using a linear predictor. |
| Dataset Splits | Yes | Given this linearly-separable dataset, we randomly split it into a training set of size 10^4, and a test set. |
| Hardware Specification | No | The paper describes a numerical experiment but does not provide specific details about the hardware used to run these experiments. |
| Software Dependencies | No | The paper mentions using gradient descent with logistic loss and a quadratic programming solver but does not specify any software versions for reproducibility. |
| Experiment Setup | Yes | Then, we trained an unbiased linear classifier using gradient descent (with step size 1 and using the logistic loss) until all but 212 examples were misclassified. After removing these examples, we were left with a dataset of 13905 examples, which by definition is linearly separable. Using a quadratic programming solver, we computed the exact max-margin predictor w , with a margin of γ := 0.0187... Given this linearly-separable dataset, we randomly split it into a training set of size 10^4, and a test set. We again ran gradient descent (with the logistic loss and step size of 1) on the training set. |