The Implicit Bias of Gradient Descent on Separable Data

Authors: Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro

JMLR 2018 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A numerical illustration of the convergence is depicted in Figure 1. As predicted by the theory, the norm w(t) grows logarithmically (note the semi-log scaling), and w(t) converges to the max-margin separator, but only logarithmically, while the loss itself decreases very rapidly (note the log-log scaling). An important practical consequence of our theory, is that although the margin of w(t) keeps improving, and so we can expect the population (or test) misclassification error of w(t) to improve for many datasets, the same cannot be said about the expected population loss (or test loss)!... These effects are demonstrated in Figure 2 and Table 1 which portray typical training of a convolutional neural network using unregularized gradient descent4.
Researcher Affiliation Academia Department of Electrical Engineering,Technion Haifa, 320003, Israel Toyota Technological Institute at Chicago Chicago, Illinois 60637, USA
Pseudocode No The paper describes mathematical proofs, theorems, and derivations. It does not include a clearly labeled "Pseudocode" or "Algorithm" block, nor does it present structured steps in a code-like format.
Open Source Code Yes Code available here: https://github.com/paper-submissions/Max Margin
Open Datasets Yes Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate and momentum, softmax output and a cross entropy loss... and Visualization of or main results on a synthetic dataset in which the L2 max margin vector ˆw is precisely known.
Dataset Splits No Figure 2: Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate and momentum, softmax output and a cross entropy loss, where we achieve 8.3% final validation error. Table 1: Sample values from various epochs in the experiment depicted in Fig. 2. ... Validation loss ... Validation error. The paper mentions the use of a validation set for CIFAR10 experiments but does not provide specific details on how the dataset was split (e.g., percentages or exact numbers of samples for training, validation, or test sets).
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. It only generally refers to training models.
Software Dependencies No The paper mentions optimization methods like "stochastic gradient descent" and "ADAM (Kingma and Ba, 2015)", but it does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes Implementation details: The dataset includes four support vectors: x1 = (0.5, 1.5) , x2 = (1.5, 0.5) with y1 = y2 = 1, and x3 = x1, x4 = x2 with y3 = y4 = 1 (the L2 normalized max margin vector is then ˆw = (1, 1) / 2 with margin equal to 2 ), and 12 other random datapoints (6 from each class), that are not on the margin. We used a learning rate η = 1/σ2 max (X), where σ2 max (X) is the maximal singular value of X, momentum γ = 0.9 for GDMO, and initialized at the origin. and Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate and momentum, softmax output and a cross entropy loss...