reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Acceleration for Classification-Based Derivative-Free Optimization

Authors: Tianyi Han, Jingya Li, Zhipeng Guo, Yuan Jin

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the synthetic functions as well as black-box tuning for language-model-as-a-service demonstrate empirically the efﬁciency of RACE-CARS. An ablation experiment on the introduced hyper-parameters is also conducted, revealing the mechanism of RACE-CARS and putting forward an empirical hyper-parameter tuning guidance.
Researcher Affiliation	Industry	Tianyi Han, Jingya Li, Zhipeng Guo, Yuan Jin Beijing Supreium Technology, Haidian District, Beijing, China EMAIL
Pseudocode	Yes	Algorithm 1: Batch-Mode Classiﬁcation-Based Optimization Algorithm; Algorithm 2: RACOS; Algorithm 3: Sequential-Mode Classiﬁcation-Based Optimization Algorithm; Algorithm 4: Accelerated Sequential-Mode Classiﬁcation Based Optimization Algorithm
Open Source Code	No	The paper provides a link for a third-party tool used for comparison: "1Code can be found in https://github.com/txsun1997/Black Box-Tuning". However, there is no explicit statement or link indicating the release of the authors' own implementation code for RACE-CARS.
Open Datasets	Yes	We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG s News (Zhang, Zhao, and Le Cun 2015), and RTE (Wang et al. 2018a).
Dataset Splits	Yes	In this part we follow the experiments designed by (Sun et al. 2022) 1, where language understanding task is formulated as a classiﬁcation task predicting for a batch of PTM-modiﬁed input texts X the labels Y in the PTM vocabulary, namely we need to tune the prompt such that the black-box PTM inference API f takes a continuous prompt p satisfying Y = f(p; X). ... We assess the algorithms based on the mean and deviation of training loss, training accuracy, development loss and development accuracy. The SST-2 dataset results are highlighted in Figure 3, with additional ﬁndings for Yelp Polarity, AG s News, and RTE detailed in the appendix.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or processor types used for running the experiments. It mentions using 'RoBERTa' as a backbone model, which implies computational resources, but no specifications are given.
Software Dependencies	No	The paper mentions "Ro BERTa (Liu et al. 2019a) serving as the backbone model" but does not specify any software libraries, frameworks, or their version numbers required to replicate the experiments.
Experiment Setup	Yes	Region shrinking rate is conﬁgured to be γ = 0.9 and 0.95, with shrinking frequency of ρ = 0.01 and 0.001 for n = 50, 500, respectively. ... For our tests, the shrinking rate is γ = 0.7, with shrinking frequency of ρ = 0.002. Each algorithm is repeated 5 times independently with unique seeds. ... In our experimental setup, we conﬁgure the search space dimension to d = 500 and the prompt length to 50, with Ro BERTa (Liu et al. 2019a) serving as the backbone model. We evaluate performance on datasets SST-2 (Socher et al. 2013), Yelp polarity and AG s News (Zhang, Zhao, and Le Cun 2015), and RTE (Wang et al. 2018a). With a ﬁxed API call budget of T = 8000