reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LensLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Authors: Xinyue Zeng, Haohui Wang, Junhong Lin, Jun Wu, Tyler Cody, Dawei Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io.
Researcher Affiliation	Academia	1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA. 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA. 3Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA. 4Department of Intelligent Systems Division, Virginia Tech, Blacksburg, VA, USA.
Pseudocode	Yes	Algorithm 1 LENSLLM Algorithm
Open Source Code	Yes	We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io.
Open Datasets	Yes	For robust evaluation across various tasks, we experiment with three benchmark datasets: FLAN (Wei et al., 2022a), Wikitext (Merity et al., 2016), and Gigaword (See et al., 2017). All of them are open-sourced on Hugging Face.
Dataset Splits	No	To analyze performance scaling, we create smaller datasets by randomly sampling examples ranging from 200 to 1,638,400 (doubling at each step), then fine-tune and evaluate models on a separate test set. The paper describes how datasets are created and used for fine-tuning and evaluation, but does not provide specific percentages or counts for distinct training, validation, and test splits typically needed for full reproducibility of data partitioning.
Hardware Specification	Yes	all experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies	No	All fine-tuning was performed by using Py Torch and the Hugging Face Transformers library. The paper mentions the use of PyTorch and Hugging Face Transformers library but does not specify their version numbers.
Experiment Setup	Yes	All models are fine-tuned using the Adam W optimizer with a weight decay of 0.01... To characterize the fine-tuning dynamics across different model architectures and data sizes, we estimate B, E, β, t for each model by minimizing the loss function: min B,E,β,t P i h LSE log B log(F(Θ, t) + Dβ i ), log E log L(Di) i... Table 8. Impact of Learning Rate and Batch Sizes on Pearson Correlation on FLAN Batch size Learning rate 3e 5 1e 4 3e 4 1e 3 64 78.36 78.41 78.40 78.39 128 78.32 78.34 78.43 78.36 256 78.37 78.36 78.36 78.34