reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sharpness-Aware Black-Box Optimization

Authors: Feiyang YE, YUEMING LYU, Xuehao Wang, Masashi Sugiyama, Yu Zhang, Ivor Tsang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretically, we prove the convergence rate and generalization bound of the proposed SABO algorithm. Empirically, extensive experiments on the black-box prompt fine-tuning tasks demonstrate the effectiveness of the proposed SABO method in improving model generalization performance.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, Southern University of Science and Technology 2Australian Artificial Intelligence Institute, University of Technology Sydney 3Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore 4Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 5RIKEN Center for Advanced Intelligence Project 6Graduate School of Frontier Sciences, The University of Tokyo 7College of Computing and Data Science, Nanyang Technological University EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 SABO Require: Neighborhood size ρ, learning rate βt 1: Initialized θ0 = (µ0, Σ0) ; 2: for t = 0 to T 1 do 3: Take i.i.d. samples z j N(0, I) and set x j = µt + Σ 1 2 t z j for j {1, . . . , N}; 4: Query the batch observations {F(x 1), . . . , F(x N)}; 5: Compute the gradient g t via Eq. (16) and compute the gradient G t via Eq. (17); 6: Compute λ = 1 Σt G t 2 F + 0.5 Σ 1 2 t g t 2 2; 7: Compute δµt and δΣt via Eq. (18); 8: Take i.i.d. samples zj N(0, I) for j {1, . . . , N}; 9: Set xj = µt + δµt + (Σt + δΣt) 1 2 zj for j {1, . . . , N}; 10: Query the batch observations {F(x1), . . . , F(x N)}; 11: Compute the gradient gt via Eq. (19) and compute the gradient Gt via Eq. (20); 12: Set µt+1 = µt βtΣtgt and set Σ 1 t+1 = Σ 1 t + 2βt Gt; 13: end for 14: return θT = (µT , ΣT ).
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	We conduct experiments on six language understanding benchmark datasets: SST-2 (Socher et al., 2013) and Yelp polarity (Zhang et al., 2015) for sentiment analysis, AG s News (Zhang et al., 2015) for topic classification, MRPC (Dolan & Brockett, 2005) for paraphrase, RTE (Wang et al., 2018) and SNLI (Bowman et al., 2015) for natural language inference. Each dataset contains a classification task. The statistics of six datasets are summarized in Table 1 of (Sun et al., 2022b). ... Fashion-MNIST (Xiao et al., 2017) is a image classifications dataset.
Dataset Splits	Yes	The statistics of six datasets are summarized in Table 1 of (Sun et al., 2022b). By following (Sun et al., 2022b), the testing accuracy is used to measure the performance of all the methods on the SST-2, AG s News, RTE, SNLI, and Yelp P. datasets, and the F1 score is used to measure the performance on the MRPC datasets. ... Fashion-MNIST (Xiao et al., 2017) is a image classifications dataset. It contains 60, 000 training samples and 10, 000 test samples, each representing a 28 28-pixel grayscale image of fashion items from 10 different classes.
Hardware Specification	Yes	All the experiments are conducted on a single NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions using "RoBERTa LARGE model (Liu et al., 2019)" but does not specify version numbers for any programming languages, libraries, or other software components used for the experiments.
Experiment Setup	Yes	For CMA-ES, MMES, BES, INGO, and SABO methods, we employ the cross-entropy loss of training data as the black-box objective for six datasets and optimize the vector v with 100 iterations. The Gaussian distributions are initialized as µ0 = 0 and Σ0 = I, and the population size N is set to 100. We perform a grid search for hyperparameters of INGO, SABO, and BES methods. Specifically, we search the learning rate β over {0.1, 0.5, 1, 5} for INGO, SABO, and BES, the neighborhood size ρ over {10, 50, 100, 500} for SABO, and the spacing c over {0.1, 1, 10} for BES. Additionally, we evaluate the performance of all methods on different dimensions of v, specifically d {200, 500, 1000}. All the experiments are performed in three independent runs, and the experimental results of mean objective std are reported.