reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Unified Approach to Routing and Cascading for LLMs

Authors: Jasper Dekoninck, Maximilian Baader, Martin Vechev

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we propose cascade routing, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.
Researcher Affiliation	Academia	1Department of Computer Science, ETH Zurich, Switzerland. Correspondence to: Jasper Dekoninck <EMAIL>.
Pseudocode	Yes	Algorithm 1 Optimal Routing Algorithm Algorithm 2 Optimal Cascading Algorithm Algorithm 3 Optimal Cascade Routing Algorithm
Open Source Code	Yes	Code available at https://github.com/eth-sri/cascade-routing
Open Datasets	Yes	We evaluate cascade routing on a range of tasks, demonstrating that it significantly outperforms both routing and cascading. Notably, cascade routing consistently outperforms other methods, improving performance by up to 8% on the Router Bench benchmark (Hu et al., 2024) and by 14% on the SWE-Bench benchmark (Jimenez et al., 2024). Further, we show that our new cascading strategy outperforms existing cascades in several scenarios by over 10%. Router Bench (Hu et al., 2024) is a benchmark developed to evaluate the efficacy of different model selection strategies. It includes questions from seven diverse benchmarks, such as MMLU (Hendrycks et al., 2021), GSM8k (Cobbe et al., 2021), and MBPP (Austin et al., 2021). We therefore use SWE-Bench (Jimenez et al., 2024) as a benchmark where accurate posthoc quality estimation is available. We use the Math and Coder models from the QWEN-2.5 model family (Yang et al., 2024; Hui et al., 2024) and evaluate them on a combination of Minerva Math (Lewkowycz et al., 2022) and Live Code Bench (Jain et al., 2024). The classification benchmarks include ARC-Challenge (Clark et al., 2018), MMLU-Pro (Wang et al., 2024), and Mix Eval (Ni et al., 2024). For open-form reasoning tasks, we use MMLU-Pro and GSM8k (Cobbe et al., 2021).
Dataset Splits	Yes	We use 5% of the Router Bench data (around 2000 samples) to optimize the hyperparameters of cascading, routing, and cascade routing. The remaining 95% is used for evaluation. For the SWE-Bench benchmark, we use its verified data split and divide the dataset into training and calibration subsets, with each comprising 50% of the data. For the Minerva Math and Live Code Bench benchmark, we only include the Algebra portion of Minerva Math to ensure that both benchmarks have a comparable number of samples for evaluation. Similarly, we also perform a 50% split of this dataset into training and calibration sets. We split each dataset in each benchmark into a training set and a test set, each comprising 50% of the data. For all datasets except GSM8k, the training set is created by splitting the original test data. In the case of GSM8k, since a separate training set is already available, we use this pre-existing training data, leaving the original test set unchanged. The training set is then further divided, with 50% used for training quality and cost estimators, and the remaining 50% reserved for hyperparameter optimization through validation.
Hardware Specification	No	The paper discusses various models and benchmarks but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions using 'API-based prices per token' for cost estimation, which implies cloud inference but without hardware specifics.
Software Dependencies	No	The paper mentions using a 'logistic regression model' and 'linear regression model' for estimation, and 'LM Evaluation Harness (Gao et al., 2024)' for evaluation, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	For a given cost budget B, there exists a λ R+ and a γ [0, 1] such that the optimal routing strategy s OPT equals γsλ MIN + (1 γ)sλ MAX. In App. A, we show how to obtain the optimal λ and γ for a cost budget B using a validation dataset D. In Algorithm 1, we provide pseudocode for the optimal routing algorithm. To determine these parameters, we estimate the cost of a strategy using a validation dataset D that is representative of the query distribution X. We then perform a hyperparameter search to find optimal values of λ and γ.