reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation

Authors: Tianyi Zhang, Junda Su, Aditya Desai, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluations with Llama and Mistral models demonstrate that Sketch Tune outperforms leading PEFT methods across diverse tasks while using substantially smaller base models and comparable trainable parameters. As a highlight, Sketch Tune outperforms Lo RA, Do RA, and S2FT on commonsense and math benchmarks using 2.63.5 smaller base models and exceeds Loft Q in accuracy by 14.48% on GSM8K with 7.3 fewer trainable parameters.
Researcher Affiliation	Collaboration	1Rice University, Houston, TX 2x MAD.ai 3University of California, Berkeley, Berkeley, CA 4Stevens Institute of Technology, Hoboken, NJ 5Third AI Corp. 6Ken Kennedy Institute.
Pseudocode	Yes	Algorithm 1 Learning to Sketch LLM Weights
Open Source Code	Yes	Our code and model checkpoints are available publicly1. 1https://github.com/Lean Models/Sketch Tune
Open Datasets	Yes	For math problem-solving, we fine-tune these models on the Math10K dataset and evaluate on 7 different math reasoning datasets (Hu et al., 2023). For commonsense reasoning, we fine-tune on the Commonsense170K dataset and evaluate on 8 different commonsense reasoning datasets (Hu et al., 2023). To compare Sketch Tune against efficient quantized model fine-tuning methods, we follow the settings in Li et al. (2023b) to fine-tune and test Llama-2 models on the language modeling dataset Wiki Text-2 (Merity et al., 2022) and the math reasoning dataset GSM8K (Cobbe et al., 2021).
Dataset Splits	Yes	The Wiki Text-2 dataset (Merity et al., 2016) consists of 44.8k training data, consisting of 36.7K training data, 3.76K validatiaon data, and 4.36K test data. Following Loft Q (Li et al., 2023b), we used the training set to perform fine-tuning, and the validataion set to evaluate fine-tuned model s performance.
Hardware Specification	Yes	We sketch each model using a single Quadro RTX 8000-48GB GPU. For model training, we train each model using a single NVIDIA A100-40GB GPU. All experiments are performed on an NVIDIA A100-40GB GPU.
Software Dependencies	No	The paper mentions "Py Torch (Paszke et al., 2019)" and "Transformers library (Wolf et al., 2020)" but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We optimize Sketch Tune s hyper-parameters, including learning rate and batch size, through a parameter sweep, and we report the hyper-parameters for training in Appendix I. Appendix I contains tables with hyperparameter selections for fine-tuning Sketch Tune on various tasks, including LR, Optimizer, Batch Size, Epochs, LR Scheduler, and Warmup Steps.