reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RouteLLM: Learning to Route LLMs from Preference Data

Authors: Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Kadous, Ion Stoica

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions.
Researcher Affiliation	Collaboration	1UC Berkeley 2Anyscale 3Canva
Pseudocode	No	The paper describes the routing approaches (Similarity-weighted ranking, Matrix factorization, BERT classifier, Causal LLM classifier) using mathematical equations and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We open source our framework for training, serving, and evaluating LLM routers, allowing users to easily train their own routers and compare router performance across benchmarks.
Open Datasets	Yes	Our primary source for preference data is the 80k battles from the online Chatbot Arena platform (Chiang et al., 2024)... We augment our training data with labeled datasets of the form Dgold = {(q, lg, ls,w) \| q Q, lg R, ls,w L}, where a golden label lg is the known correct answer, e.g. in multiple-choice questions. Specifically, we use the validation split of the MMLU multiple choice benchmark (Hendrycks et al., 2020)... Fortunately, the Nectar dataset (Zhu et al., 2023) offers a wide variety of queries with corresponding model responses... We evaluate our routers on three widely-used academic benchmarks: MMLU (Hendrycks et al., 2020) consisting of 14,042 questions across 57 subjects, MT Bench (Zheng et al., 2023) with 160 open-ended questions using LLM-as-a-judge, and GSM8K (Cobbe et al., 2021) with over 1,000 grade school math problems.
Dataset Splits	Yes	As mentioned in Sec. 4.1, we primarily use the 80K Chatbot Arena data Darena for training our models, but hold out 5k samples for validation. We prune all prompt samples shorter than 16 characters, resulting in 65k pairwise comparisons between 64 different models. ... Specifically, we use the validation split of the MMLU multiple choice benchmark (Hendrycks et al., 2020)...
Hardware Specification	Yes	We train the model on a 8GB GPU for 10 epochs... We train the model on 2x L4 24GB GPUs for 2000 steps... We train the model on 8x A100 80GB GPUs for 2000 steps... For routers that use GPUs, namely matrix factorization and the classifier methods, we utilize Google Cloud’s g2-standard-4 VM containing a single NVIDIA L4 GPU. For similarity-weighted ranking, we use Google Cloud’s CPU-only n2-standard-8 VM.
Software Dependencies	Yes	We use a BERTBASE architecture (Devlin et al., 2018)... We finally expand the capacity of our router by parameterizing it with Llama 3 8B (AI@Meta, 2024b)... For both the matrix factorization router and the SW ranking router, we use Open AI’s embedding model text-embedding-3-small to embed the input query.
Experiment Setup	Yes	We train the model on a 8GB GPU for 10 epochs, using batch size 64 and the Adam optimizer (Kingma & Ba, 2017) with learning rate 3e-4 and weight decay 1e-5. ... We train the model on 2x L4 24GB GPUs for 2000 steps using a batch size of 16, maximum sequence length of 512, learning rate of 1e-5 and a weight decay of 0.01. ... We train the model on 8x A100 80GB GPUs for 2000 steps using a batch size of 8, maximum sequence length of 2048, and a learning rate of 1e-6.