Sharpness-Aware Black-Box Optimization
Authors: Feiyang YE, YUEMING LYU, Xuehao Wang, Masashi Sugiyama, Yu Zhang, Ivor Tsang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretically, we prove the convergence rate and generalization bound of the proposed SABO algorithm. Empirically, extensive experiments on the black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving model generalization performance. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, Southern University of Science and Technology 2Australian Artificial Intelligence Institute, University of Technology Sydney 3Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 4Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 5RIKEN Center for Advanced Intelligence Project 6Graduate School of Frontier Sciences, The University of Tokyo 7College of Computing and Data Science, Nanyang Technological University EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 SABO Require: Neighborhood size ρ, learning rate βt 1: Initialized θ0 = (µ0, Σ0) ; 2: for t = 0 to T 1 do 3: Take i.i.d. samples z j N(0, I) and set x j = µt + Σ 1 2 t z j for j {1, . . . , N}; 4: Query the batch observations {F(x 1), . . . , F(x N)}; 5: Compute the gradient g t via Eq. (16) and compute the gradient G t via Eq. (17); 6: Compute λ = 1 Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2; 7: Compute δµt and δΣt via Eq. (18); 8: Take i.i.d. samples zj N(0, I) for j {1, . . . , N}; 9: Set xj = µt + δµt + (Σt + δΣt) 1 2 zj for j {1, . . . , N}; 10: Query the batch observations {F(x1), . . . , F(x N)}; 11: Compute the gradient gt via Eq. (19) and compute the gradient Gt via Eq. (20); 12: Set µt+1 = µt βtΣtgt and set Σ 1 t+1 = Σ 1 t + 2βt Gt; 13: end for 14: return θT = (µT , ΣT ). |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We conduct experiments on six language understanding benchmark datasets: SST-2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015) for sentiment analysis, AG s News (Zhang et al., 2015) for topic classification, MRPC (Dolan & Brockett, 2005) for paraphrase, RTE (Wang et al., 2018) and SNLI (Bowman et al., 2015) for natural language inference. Each dataset contains a classification task. The statistics of six datasets are summarized in Table 1 of (Sun et al., 2022b). ... Fashion-MNIST (Xiao et al., 2017) is a image classifications dataset. |
| Dataset Splits | Yes | The statistics of six datasets are summarized in Table 1 of (Sun et al., 2022b). By following (Sun et al., 2022b), the testing accuracy is used to measure the performance of all the methods on the SST-2, AG s News, RTE, SNLI, and Yelp P. datasets, and the F1 score is used to measure the performance on the MRPC datasets. ... Fashion-MNIST (Xiao et al., 2017) is a image classifications dataset. It contains 60, 000 training samples and 10, 000 test samples, each representing a 28 28-pixel grayscale image of fashion items from 10 different classes. |
| Hardware Specification | Yes | All the experiments are conducted on a single NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using "RoBERTa LARGE model (Liu et al., 2019)" but does not specify version numbers for any programming languages, libraries, or other software components used for the experiments. |
| Experiment Setup | Yes | For CMA-ES, MMES, BES, INGO, and SABO methods, we employ the cross-entropy loss of training data as the black-box objective for six datasets and optimize the vector v with 100 iterations. The Gaussian distributions are initialized as µ0 = 0 and Σ0 = I, and the population size N is set to 100. We perform a grid search for hyperparameters of INGO, SABO, and BES methods. Specifically, we search the learning rate β over {0.1, 0.5, 1, 5} for INGO, SABO, and BES, the neighborhood size ρ over {10, 50, 100, 500} for SABO, and the spacing c over {0.1, 1, 10} for BES. Additionally, we evaluate the performance of all methods on different dimensions of v, specifically d {200, 500, 1000}. All the experiments are performed in three independent runs, and the experimental results of mean objective std are reported. |