BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Authors: Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop. [...] Experiments on large-scale, real-world datasets (Section 5) demonstrate that our method achieves up to 60% cost reduction with less than 1% performance degradation, significantly improving upon prior routing techniques and contributing toward more efficient LLM service deployment. |
| Researcher Affiliation | Collaboration | 1University of British Columbia (work performed during internship at Microsoft) 2Microsoft 3Pennsylvania State University 4Google (work performed while at Microsoft) 5University of British Columbia 6AG2AI, Inc. |
| Pseudocode | Yes | Algorithm 1 BEST-Route Input: Query q, Maximal sample number n, Match probability threshold t, Proxy reward model Rproxy; Models {M1, ..., MK}, Reference model Mref; Average output lengths avg output length[M]; Input and output token prices input token price[M], output token price[M]; Output: Final response. |
| Open Source Code | No | Codes will be released upon acceptance of this work. |
| Open Datasets | Yes | We introduce a large-scale dataset covering diverse tasks, including question answering, coding, and safety evaluation, with examples collected from multiple sources (see Appendix A.1). The dataset consists of 10K instruction examples, split into 8K/1K/1K for training, validation, and testing. We evaluate BEST-Route across 8 popular LLMs GPT-4o, GPT-3.5-turbo, Llama-3.1-8B, Mistral-7B, Mistral-8x7B, Phi-3-mini, Phi-3-medium, and Codestral-22B by generating 20 responses per example. We further perform out-of-distribution (OOD) evaluation of BEST-Route using MT-Bench (Zheng et al., 2023). |
| Dataset Splits | Yes | The dataset consists of 10K instruction examples, split into 8K/1K/1K for training, validation, and testing. |
| Hardware Specification | Yes | All inference experiments are conducted using paid API access from Open AI 3, Azure ML 4, and Mistral AI 5, while router training and inference are performed on an NVIDIA A100 GPU (80GB RAM). |
| Software Dependencies | No | The paper mentions specific models like "De BERTav3-small" and "Open Assistant RM 2, a De BERTa-v3-large model" but does not provide version numbers for general software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | We train both Multi-Head Router and the proxy reward model with the corresponding loss from Section 4 for 5 epochs and use the validation set to choose the best checkpoints for final evaluation. |