reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EmbedLLM: Learning Compact Representations of Large Language Models

Authors: Richard Zhuang, Tianhao Wu, Zhaojin Wen, Andrew Li, Jiantao Jiao, Kannan Ramchandran

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that Embed LLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model s performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them.
Researcher Affiliation	Academia	Richard Zhuang Tianhao Wu Zhaojin Wen Andrew Li Jiantao Jiao Kannan Ramchandran University of California, Berkeley
Pseudocode	No	The paper includes a section titled "4.3 ALGORITHM" which describes the methodology, but it is presented in prose with mathematical equations rather than a structured pseudocode block or code-like format with explicit steps or line numbers.
Open Source Code	Yes	We open source our dataset, code and embedder to facilitate further research and application: https://github.com/richardzhuang0412/Embed LLM.
Open Datasets	Yes	We open source our dataset, code and embedder to facilitate further research and application: https://github.com/richardzhuang0412/Embed LLM. We aggregated responses of every model to 36,054 questions from the test sets of MMLU (Hendrycks et al., 2021), Truthful QA (Lin et al., 2022) , Social QA (Sap et al., 2019), PIQA(Bisk et al., 2019), Med MCQA(Pal et al., 2022), Math QA(Amini et al., 2019), Logi QA(Liu et al., 2020), GSM8K(Cobbe et al., 2021), GPQA(Rein et al., 2023), and ASDiv(Miao et al., 2020).
Dataset Splits	Yes	We performed a random 80%-10%-10% train-validation-test split on the questions and used the sentence transformer all-mpnet-base-v2 (Reimers & Gurevych, 2019) to convert the questions into an initial embedding state of dimension dimq = 768.
Hardware Specification	Yes	On one NVIDIA A100 80GB GPU, it takes in average 3.80 seconds for Embed LLM router to route 3,000 questions on 50 repeated trials which is basically free compare to the downstream model inference time.
Software Dependencies	No	The paper mentions "lm-evaluation-harness package (Gao et al., 2023)" and "sentence transformer all-mpnet-base-v2 (Reimers & Gurevych, 2019)" but does not provide specific version numbers for key software components such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions) that would be needed for reproducibility.
Experiment Setup	Yes	We conduct hyperparameter tuning (number of neighbors for KNN, model embedding dimension for Embed LLM) on a fixed validation set and evaluate prediction accuracy using a fixed test set. Training Embed LLM on a correctness matrix of around 20,000 questions on 112 models for 50 epochs with batch size or 2,048 costs 107.71 TFlops, approximately equivalent to the querying a 7B model for 60 times using an input of length 128.