reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Authors: Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop. [...] Experiments on large-scale, real-world datasets (Section 5) demonstrate that our method achieves up to 60% cost reduction with less than 1% performance degradation, significantly improving upon prior routing techniques and contributing toward more efficient LLM service deployment.
Researcher Affiliation	Collaboration	1University of British Columbia (work performed during internship at Microsoft) 2Microsoft 3Pennsylvania State University 4Google (work performed while at Microsoft) 5University of British Columbia 6AG2AI, Inc.
Pseudocode	Yes	Algorithm 1 BEST-Route Input: Query q, Maximal sample number n, Match probability threshold t, Proxy reward model Rproxy; Models {M1, ..., MK}, Reference model Mref; Average output lengths avg output length[M]; Input and output token prices input token price[M], output token price[M]; Output: Final response.
Open Source Code	No	Codes will be released upon acceptance of this work.
Open Datasets	Yes	We introduce a large-scale dataset covering diverse tasks, including question answering, coding, and safety evaluation, with examples collected from multiple sources (see Appendix A.1). The dataset consists of 10K instruction examples, split into 8K/1K/1K for training, validation, and testing. We evaluate BEST-Route across 8 popular LLMs GPT-4o, GPT-3.5-turbo, Llama-3.1-8B, Mistral-7B, Mistral-8x7B, Phi-3-mini, Phi-3-medium, and Codestral-22B by generating 20 responses per example. We further perform out-of-distribution (OOD) evaluation of BEST-Route using MT-Bench (Zheng et al., 2023).
Dataset Splits	Yes	The dataset consists of 10K instruction examples, split into 8K/1K/1K for training, validation, and testing.
Hardware Specification	Yes	All inference experiments are conducted using paid API access from Open AI 3, Azure ML 4, and Mistral AI 5, while router training and inference are performed on an NVIDIA A100 GPU (80GB RAM).
Software Dependencies	No	The paper mentions specific models like "De BERTav3-small" and "Open Assistant RM 2, a De BERTa-v3-large model" but does not provide version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	We train both Multi-Head Router and the proxy reward model with the corresponding loss from Section 4 for 5 epochs and use the validation set to choose the best checkpoints for final evaluation.