reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformer-Squared: Self-adaptive LLMs

Authors: Qi Sun, Edoardo Cetin, Yujin Tang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SVF and the full Transformer2 framework through extensive experiments across a diverse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as Lo RA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer2 is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering.
Researcher Affiliation	Collaboration	Qi Sun1,2, Edoardo Cetin1, Yujin Tang1* 1Sakana AI, Japan 2Institute of Science Tokyo, Japan EMAIL *Equal contribution
Pseudocode	Yes	We illustrate a complete CEM step in the Python pseudocode below.
Open Source Code	Yes	We provide our full source code at https://github.com/SakanaAI/self-adaptive-llms.
Open Datasets	Yes	To validate the generality of Transformer2 we consider three pre-trained LLMs ranging across different model families and architecture sizes: LLAMA3-8B-INSTRUCT, MISTRAL-7B-INSTRUCTV0.3, and LLAMA3-70B-INSTRUCT. For each model, we obtain three sets of SVF-trained z vectors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of z vectors for LLAMA3-8B-INSTRUCT, when applied as the language backbone for Text VQA (Singh et al., 2019), in order to assess SVF s applicability to the vision-language modeling (VLM) domain. We provide SVF s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer2 adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019).
Dataset Splits	No	We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies	No	The paper mentions 'Python pseudocode' but does not specify version numbers for Python or any other libraries, frameworks, or software used in the experiments.
Experiment Setup	Yes	We obtain the expert vectors z as the base components in Transformer2 by training the SVF finetunes with a consistent recipe across the considered training tasks and language models. We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance. Table 6: Hyper-parameters used for SVF and Lo RA training. SVF Hyperparameters: Initial mean value of z 0.1, Initial variance value of z 1 10 3, Global batch size 256, Learning rate 2 10 3, Clip max norm 1 10 3, KL coefficient λ 0.0, 0.1, 0.2, 0.3. Lo RA Hyperparameters: Rank 16, Lo RA alpha 32, Lo RA dropout 0.05, Global batch size 256, Learning rate 2 10 4, 5 10 4, 2 10 5, 5 10 5, 2 10 6. 5 10 6, Clip max norm 1 10 3, 1.0.