Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Authors: Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1. The logistic risk and zero-one error along the GD path for an overparameterized logistic regression problem. Here d = 2000, n = 1000, λi = i 2, w 0:100 = 1 and w 100: = 0. The optimization length is measured by ηt. The plots show that the excess logistic risk and excess zero-one error are both small for GD with appropriate early stopping, and both grow larger when GD enters the interpolation regime. These demonstrate the regularization of early stopping in GD.
Researcher Affiliation Collaboration 1University of California, Berkeley 2Google Deep Mind 3New York University. Correspondence to: Jingfeng Wu <EMAIL>, Peter L. Bartlett <EMAIL>, Matus Telgarsky <EMAIL>, Bin Yu <EMAIL>.
Pseudocode No No explicit pseudocode or algorithm blocks are provided. The gradient descent steps are described using mathematical equations rather than structured algorithmic formatting, for example: 'w0 = 0, wt+1 = wt η b L(wt), t 0, (GD)'.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets No We focus on a well-specified setting where the feature vector follows an anisotropic Gaussian design and the binary label conditional on the feature is given by a logistic model (see Assumption 1 in Section 2). This describes a synthetic data generation process rather than referencing an existing publicly available dataset with concrete access information.
Dataset Splits No Let (xi, yi)n i=1 be n independent copies of (x, y). Define the empirical risk as b L(w) := 1/n sum(ℓ(yix i w), i=1 to n), w H. The paper describes data generation and a sample size 'n', but does not specify any training/test/validation splits for experimental reproduction.
Hardware Specification No The paper does not provide any specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers. It refers to general methods like 'gradient descent' and 'logistic regression' but not specific libraries or frameworks.
Experiment Setup No The iterates of gradient descent (GD) are given by w0 = 0, wt+1 = wt η b L(wt), t 0, (GD) where η > 0 is a fixed stepsize. While a stepsize η is mentioned, specific values or ranges for experimental hyperparameters like learning rate, batch size, epochs, or optimizer settings are not provided. The parameters mentioned in Figure 1 caption (d = 2000, n = 1000, λi = i 2, w 0:100 = 1 and w 100: = 0) relate to the data generation model, not the experimental setup.