LensLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
Authors: Xinyue Zeng, Haohui Wang, Junhong Lin, Jun Wu, Tyler Cody, Dawei Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Virginia Tech, Blacksburg, VA, USA. 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA. 3Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA. 4Department of Intelligent Systems Division, Virginia Tech, Blacksburg, VA, USA. |
| Pseudocode | Yes | Algorithm 1 LENSLLM Algorithm |
| Open Source Code | Yes | We open-source our proposed LENSLLM model and corresponding results at Lens LLM.io. |
| Open Datasets | Yes | For robust evaluation across various tasks, we experiment with three benchmark datasets: FLAN (Wei et al., 2022a), Wikitext (Merity et al., 2016), and Gigaword (See et al., 2017). All of them are open-sourced on Hugging Face. |
| Dataset Splits | No | To analyze performance scaling, we create smaller datasets by randomly sampling examples ranging from 200 to 1,638,400 (doubling at each step), then fine-tune and evaluate models on a separate test set. The paper describes how datasets are created and used for fine-tuning and evaluation, but does not provide specific percentages or counts for distinct training, validation, and test splits typically needed for full reproducibility of data partitioning. |
| Hardware Specification | Yes | all experiments are conducted on a single NVIDIA A100 GPU with 80GB of memory. |
| Software Dependencies | No | All fine-tuning was performed by using Py Torch and the Hugging Face Transformers library. The paper mentions the use of PyTorch and Hugging Face Transformers library but does not specify their version numbers. |
| Experiment Setup | Yes | All models are fine-tuned using the Adam W optimizer with a weight decay of 0.01... To characterize the fine-tuning dynamics across different model architectures and data sizes, we estimate B, E, β, t for each model by minimizing the loss function: min B,E,β,t P i h LSE log B log(F(Θ, t) + Dβ i ), log E log L(Di) i... Table 8. Impact of Learning Rate and Batch Sizes on Pearson Correlation on FLAN Batch size Learning rate 3e 5 1e 4 3e 4 1e 3 64 78.36 78.41 78.40 78.39 128 78.32 78.34 78.43 78.36 256 78.37 78.36 78.36 78.34 |