LensLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Authors: Xinyue Zeng, Haohui Wang, Junhong Lin, Jun Wu, Tyler Cody, Dawei Zhou

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io.
Researcher Affiliation Academia 1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA. 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA. 3Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA. 4Department of Intelligent Systems Division, Virginia Tech, Blacksburg, VA, USA.
Pseudocode Yes Algorithm 1 LENSLLM Algorithm
Open Source Code Yes We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io.
Open Datasets Yes For robust evaluation across various tasks, we experiment with three benchmark datasets: FLAN (Wei et al., 2022a), Wikitext (Merity et al., 2016), and Gigaword (See et al., 2017). All of them are open-sourced on Hugging Face.
Dataset Splits No To analyze performance scaling, we create smaller datasets by randomly sampling examples ranging from 200 to 1,638,400 (doubling at each step), then fine-tune and evaluate models on a separate test set. The paper describes how datasets are created and used for fine-tuning and evaluation, but does not provide specific percentages or counts for distinct training, validation, and test splits typically needed for full reproducibility of data partitioning.
Hardware Specification Yes all experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory.
Software Dependencies No All fine-tuning was performed by using Py Torch and the Hugging Face Transformers library. The paper mentions the use of PyTorch and Hugging Face Transformers library but does not specify their version numbers.
Experiment Setup Yes All models are fine-tuned using the Adam W optimizer with a weight decay of 0.01... To characterize the fine-tuning dynamics across different model architectures and data sizes, we estimate B, E, β, t for each model by minimizing the loss function: min B,E,β,t P i h LSE log B log(F(Θ, t) + Dβ i ), log E log L(Di) i... Table 8. Impact of Learning Rate and Batch Sizes on Pearson Correlation on FLAN Batch size Learning rate 3e 5 1e 4 3e 4 1e 3 64 78.36 78.41 78.40 78.39 128 78.32 78.34 78.43 78.36 256 78.37 78.36 78.36 78.34