Stability and Generalization for Stochastic (Compositional) Optimizations

Authors: Xiaokang Pan, Jin Liu, Hulin Kuang, Youqi Li, Lixing Chen, Zhe Qu

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted a toy experiment to compare SGD and STORM in convex and non-convex settings, as shown in Figure 1. In the non-convex setting, STORM demonstrated a faster convergence rate than SGD, as shown in Figure 1b, and it also exhibited superior generalization performance because the gap between training loss and testing loss is smaller. However, in the convex setting (Figure 1a), the trend reversed. It can be noticed that the estimator not only changes the convergence rate of the algorithm, but also the generalization of the algorithm. This observation raises an interesting question: How does the estimator affect the generalization? Therefore, in this paper, we analyze the effect of estimators on generalization under the SO and SCO problems. We first conduct a generalization analysis of STORM, comparing the results to SGD and exploring the estimator s impact on generalization as an initial step. Considering the complexity of the SCO problem, we then propose a generalized algorithmic framework that encompasses many existing algorithms and applies to multiple estimators and various estimation strategies i.e., the objects to which the estimator is applied. Subsequently, we analyze the generalization of this framework in both convex and non-convex settings. Through our analysis, we identify the estimators that are crucial and those that are less important. Furthermore, we perform a generalization analysis of three representative algorithms SCGD, SCSC, and COVER under the SCO problem to compare how different estimation strategies impact algorithm generalization. Based on this analysis, we also establish a relationship between the convergence rate and generalization, summarized in Table 2.
Researcher Affiliation Academia Xiaokang Pan1,2 , Jin Liu1,2 , Hulin Kuang1,2, , Youqi Li3 , Lixing Chen4,5 , Zhe Qu1,2, 1School of Computer Science and Engineering, Central South University 2Xiangjiang Laboratory 3School of Computer Science and Technology, Beijing Institute of Technology 4School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University 5Shanghai Key Laboratory of Integrated Administration Technologies for Information Security EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Framework of SCO 1: Inputs: Training data S = {ν1, . . . , νn, ω1, . . . , ωm}; Number of iterations T, parameter sequence {ηt}, {βt} 2: Initialize x0 X and y0 Rd 3: for t = 0 to T 1 do 4: Randomly sample jt [1, m], obtain gωjt (xt) and gωjt (xt) Rd d2 5: Estimate inner function value ut according to Eq.(5) 6: Estimate inner function gradient vt according to Eq. (6) or Eq. (7) 7: Randomly sample it [1, n], obtain fνit (ut) Rd 8: Calculate the total gradient vt = vt fνit (ut) 9: Update: 10: xt+1 = ΠX xt ηtvt 11: end for 12: Outputs: A(S) = x T or xτ Unif({xt}T t=1)
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to a code repository.
Open Datasets No For SO problems, we use linear regression in both convex and non-convex settings. For SCO problems, we generate two datasets, S1 and S2, with |S1| = |S2|. The inner function g( ) fits S1, and f(g( )) fits S2. We aim to minimize the overall loss of the fits on both datasets.
Dataset Splits No For SCO problems, we generate two datasets, S1 and S2, with |S1| = |S2|. The inner function g( ) fits S1, and f(g( )) fits S2. We aim to minimize the overall loss of the fits on both datasets. The paper mentions generating datasets but does not specify train/test/validation splits for these datasets.
Hardware Specification No We conduct simulations to validate our theoretical results by examining the relationship between convergence and generalization. To simulate high-dimensional issues, we set the dimension of X to 100. The paper mentions conducting simulations but does not specify the hardware used for these simulations.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes To simulate high-dimensional issues, we set the dimension of X to 100. We generate data using the objective function and add Gaussian noise with mean 0 and variance 1 to each dimension to mimic stochastic optimization. We use MSE loss for the convex setting and combine the tanh activation function with MSE loss for the non-convex setting, resulting in a total loss of MSE loss + α tanh(ypredict), where α = 0.1 in both SO and SCO problems. For SO problems, we use linear regression in both convex and non-convex settings. For SCO problems, we generate two datasets, S1 and S2, with |S1| = |S2|. The inner function g( ) fits S1, and f(g( )) fits S2. We aim to minimize the overall loss of the fits on both datasets. For iteration counts of different methods, we use the results from Table 2 and round to the nearest integer.