Transformer-Squared: Self-adaptive LLMs

Authors: Qi Sun, Edoardo Cetin, Yujin Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SVF and the full Transformer2 framework through extensive experiments across a diverse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as Lo RA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer2 is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering.
Researcher Affiliation Collaboration Qi Sun1,2*, Edoardo Cetin1*, Yujin Tang1* 1Sakana AI, Japan 2Institute of Science Tokyo, Japan EMAIL *Equal contribution
Pseudocode Yes We illustrate a complete CEM step in the Python pseudocode below.
Open Source Code Yes We provide our full source code at https://github.com/SakanaAI/self-adaptive-llms.
Open Datasets Yes To validate the generality of Transformer2 we consider three pre-trained LLMs ranging across different model families and architecture sizes: LLAMA3-8B-INSTRUCT, MISTRAL-7B-INSTRUCTV0.3, and LLAMA3-70B-INSTRUCT. For each model, we obtain three sets of SVF-trained z vectors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of z vectors for LLAMA3-8B-INSTRUCT, when applied as the language backbone for Text VQA (Singh et al., 2019), in order to assess SVF s applicability to the vision-language modeling (VLM) domain. We provide SVF s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer2 adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019).
Dataset Splits No We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies No The paper mentions 'Python pseudocode' but does not specify version numbers for Python or any other libraries, frameworks, or software used in the experiments.
Experiment Setup Yes We obtain the expert vectors z as the base components in Transformer2 by training the SVF finetunes with a consistent recipe across the considered training tasks and language models. We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance. Table 6: Hyper-parameters used for SVF and Lo RA training. SVF Hyperparameters: Initial mean value of z 0.1, Initial variance value of z 1 10 3, Global batch size 256, Learning rate 2 10 3, Clip max norm 1 10 3, KL coefficient λ 0.0, 0.1, 0.2, 0.3. Lo RA Hyperparameters: Rank 16, Lo RA alpha 32, Lo RA dropout 0.05, Global batch size 256, Learning rate 2 10 4, 5 10 4, 2 10 5, 5 10 5, 2 10 6. 5 10 6, Clip max norm 1 10 3, 1.0.