reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GraphRouter: A Graph-based Router for LLM Selections

Authors: Tao Feng, Yanzhen Shen, Jiaxuan You

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments across three distinct effect-cost weight scenarios have shown that Graph Router substantially surpasses existing routers, delivering a minimum performance improvement of 12.3%. In addition, it achieves enhanced generalization across new LLMs settings and supports diverse tasks with at least a 9.5% boost in effect and a significant reduction in computational demands.
Researcher Affiliation	Academia	Tao Feng, Yanzhen Shen, Jiaxuan You Department of Computer Science University of Illinois Urbana Champaign Urbana, IL, USA EMAIL
Pseudocode	Yes	Algorithm 1 Training of Graph Router
Open Source Code	Yes	Our codes for Graph Router is released at https://github.com/ulab-uiuc/Graph Router.
Open Datasets	Yes	Alpaca (Taori et al., 2023) is a hybrid question-answer (QA) dataset containing 52k samples used for fine-tuning the Alpaca model. GSM8K (Cobbe et al., 2021) evaluates the model s ability for multi-step mathematical reasoning with 8.5k linguistically diverse grade school math word problems. SQUAD (Rajpurkar, 2016) is a crowdsourced reading comprehension dataset based on Wiki articles. Multi-News (Fabbri et al., 2019) is a benchmark on multi-document summarization. Human Eval (Chen et al., 2021), is a dataset that measures LLMs coding capabilities Hotpot QA (Yang et al., 2018), a question answering dataset with 113k entries featuring natural, multi-hop questions
Dataset Splits	Yes	The data is divided into training, validation, and test sets in a ratio of 70% : 10%: 20%, based on different queries.
Hardware Specification	Yes	all the experiments are conducted on a single NVIDIA A100 Tensor Core GPU.
Software Dependencies	No	The paper mentions "Py Torch2" and "Py G3" but does not provide specific version numbers for these software components. It only provides links to their general websites.
Experiment Setup	Yes	In the training stage, we set the graph neural network as a two-layer graph attention network, with a 32-dim hidden dimension. The batch size is 32, and the max training epoch is set to 1000. We use Adam optimizer (Diederik, 2014) for model training and gradually decay the learning rate from 1e-3 to 0 with Lambda LR scheduler.