reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Text-to-LoRA: Instant Transformer Adaption

Authors: Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	After training T2L on a suite of 9 pre-trained Lo RA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed Lo RA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of Lo RA instances and zeroshot generalize to entirely unseen tasks.
Researcher Affiliation	Industry	1Sakana AI. Correspondence to: Rujikorn Charakorn <EMAIL>, Robert T. Lange <EMAIL>.
Pseudocode	No	The paper only describes steps in regular paragraph text without structured formatting, and does not contain any figure, block, or section explicitly labeled "Pseudocode" or "Algorithm".
Open Source Code	Yes	Our code is available at https://github.com/SakanaAI/text-to-lora.
Open Datasets	Yes	We utilize the SNI dataset (Wang et al., 2022) for training Lo RA adapters. We use 500 SNI datasets publicly available at https://huggingface.co/Lots-of-LoRAs. For evaluation, we choose 10 widely used benchmarks that collectively cover a variety of LLM capability assessments, e.g., reasoning, math, science, coding, and world knowledge. Specifically, we include the following benchmarks: Arc-challenge (Arc C) and Arc-easy (Arc E) (Clark et al., 2018), Bool Q (Clark et al., 2019), GSM8K (Cobbe et al., 2021), Hellaswag (HS) (Zellers et al., 2019), Open Book QA (OQA) (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), Winogrande (WG) (Keisuke et al., 2019), Human Eval (HE) (Chen et al., 2021), and MBPP (Austin et al., 2021).
Dataset Splits	Yes	We use 11 tasks for hold-out validation and removed 10 datasets due to data contamination from the evaluation benchmark tasks, leaving 479 datasets for training. All samples are in English. More details of the datasets can be found in Appendix J. For evaluation, we choose 10 widely used benchmarks... We evaluate the models on the test split, using chain-of-thought response pre-filling: Let’s think step by step. (J.1.1 GSM8K) Human Eval only has the test split, therefore it is always evaluated against in the zero-shot manner. (J.1.2 HUMANEVAL AND MBPP)
Hardware Specification	Yes	All models trained in this work fit in a single H100 GPU (80GB of VRAM).
Software Dependencies	No	The paper mentions software like Mistral-7B-Instruct (Jiang et al., 2023) as the base LLM model and gte-large-en-v1.5 (Li et al., 2023; Zhang et al., 2024) for task embedding, implying use of libraries like PyTorch (from torch.cuda.FloatTensor). However, it does not provide specific version numbers for these or other key software components (e.g., Python version, specific library versions).
Experiment Setup	Yes	Table 11: Hyperparameters for training a task-specific Lo RA adapter. Hyperparameters Task-specific Lo RA T2L (SFT) T2L (recon) Batch size 8 8 Number of the target Lo RAs Gradient accumulation steps 1 1 1 Max learning rate 8 10-5 2.5 10-5 10-3 Max gradient norm 1.0 1.0 1.0 NEFTune noise alpha 5.0 5.0 5.0 Warmup fraction 0.1 0.1 0.1 Learning rate scheduler Linear with warm up Linear with warm up Linear with warm up