Efficient Dynamic Ensembling for Multiple LLM Experts

Authors: Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, Mingkui Tan

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines.
Researcher Affiliation Academia 1South China University of Technology 2Pazhou Laboratory 3Peng Cheng Laboratory 4Hong Kong Polytechnic University 5Chongqing University of Posts and Telecommunications 6Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
Pseudocode Yes Algorithm 1 PPO Training for DER
Open Source Code Yes Code and appendix are available at https://github.com/Fhujinwu/DER.
Open Datasets Yes Following the settings of Pair Ranker [Jiang et al., 2023], we use Mix Instruct as the benchmark. In addition, we use the GSM8K [Cobbe et al., 2021] and Multidomain we constructed (see Appendix 2.3) for further evaluation.
Dataset Splits No The paper mentions using a 'Mix Instruct testset' and discusses 'answer route length generated by DER for all samples' in Table 6. It also refers to a 'Prompt-Answer dataset D' for training the DER-Agent. However, it does not explicitly provide specific details about the training/validation/test splits, percentages, or sample counts used for these datasets to allow for reproduction of data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It only refers to 'computational resources' generally.
Software Dependencies No The paper mentions using Proximal Policy Optimization (PPO) [Schulman et al., 2017] for training, BERTScore [Zhang et al., 2019] for evaluation, and OPT-125M [Zhang et al., 2022] for the actor and critic models. However, it does not provide specific version numbers for any of these software components, libraries, or frameworks.
Experiment Setup Yes We formulate the selection of the sequential execution route of LLMs as Markov Decision Process (MDP) [Van Otterlo and Wiering, 2012]: < S, A, T , R, π >. ... where P( ) is the BERTScore, which is commonly used to evaluate the quality of generated text and its high correlation with human judgment [Zhang et al., 2019]. The ˆyt is the output answer of the selected LLM Mat, C( ) is the computation cost of Mat, P(ˆy) = P(ˆyt) P(ˆyt 1) is the increment of the BERTScore of the answer from t 1 to t, and α, β is the coefficient to determine the ratio of computation cost and the increment of the score, respectively. ... where p0 is the threshold of the BERTScore for which an environment gives an end, Tmax is the maximum step size, and γ is the bias for extra rewards or penalties. ... ϵ is a hyperparameter, usually set to 0.2 [Schulman et al., 2017].