reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Dynamic Ensembling for Multiple LLM Experts

Authors: Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, Mingkui Tan

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines.
Researcher Affiliation	Academia	1South China University of Technology 2Pazhou Laboratory 3Peng Cheng Laboratory 4Hong Kong Polytechnic University 5Chongqing University of Posts and Telecommunications 6Key Laboratory of Big Data and Intelligent Robot, Ministry of Education
Pseudocode	Yes	Algorithm 1 PPO Training for DER
Open Source Code	Yes	Code and appendix are available at https://github.com/Fhujinwu/DER.
Open Datasets	Yes	Following the settings of Pair Ranker [Jiang et al., 2023], we use Mix Instruct as the benchmark. In addition, we use the GSM8K [Cobbe et al., 2021] and Multidomain we constructed (see Appendix 2.3) for further evaluation.
Dataset Splits	No	The paper mentions using a 'Mix Instruct testset' and discusses 'answer route length generated by DER for all samples' in Table 6. It also refers to a 'Prompt-Answer dataset D' for training the DER-Agent. However, it does not explicitly provide specific details about the training/validation/test splits, percentages, or sample counts used for these datasets to allow for reproduction of data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running the experiments. It only refers to 'computational resources' generally.
Software Dependencies	No	The paper mentions using Proximal Policy Optimization (PPO) [Schulman et al., 2017] for training, BERTScore [Zhang et al., 2019] for evaluation, and OPT-125M [Zhang et al., 2022] for the actor and critic models. However, it does not provide specific version numbers for any of these software components, libraries, or frameworks.
Experiment Setup	Yes	We formulate the selection of the sequential execution route of LLMs as Markov Decision Process (MDP) [Van Otterlo and Wiering, 2012]: < S, A, T , R, π >. ... where P( ) is the BERTScore, which is commonly used to evaluate the quality of generated text and its high correlation with human judgment [Zhang et al., 2019]. The ˆyt is the output answer of the selected LLM Mat, C( ) is the computation cost of Mat, P(ˆy) = P(ˆyt) P(ˆyt 1) is the increment of the BERTScore of the answer from t 1 to t, and α, β is the coefficient to determine the ratio of computation cost and the increment of the score, respectively. ... where p0 is the threshold of the BERTScore for which an environment gives an end, Tmax is the maximum step size, and γ is the bias for extra rewards or penalties. ... ϵ is a hyperparameter, usually set to 0.2 [Schulman et al., 2017].