Transformer-Squared: Self-adaptive LLMs
Authors: Qi Sun, Edoardo Cetin, Yujin Tang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SVF and the full Transformer2 framework through extensive experiments across a diverse range of LLMs and tasks. First, when trained on domain-specific datasets, we show that SVF consistently outperforms traditional strategies for efficient fine-tuning such as Lo RA, and at the same time, with orders of magnitudes fewer parameters. Then we show that Transformer2 is able to push performance far further, effectively adapting the weights of the base model even in entirely out-of-distribution applications such as visual question answering. |
| Researcher Affiliation | Collaboration | Qi Sun1,2*, Edoardo Cetin1*, Yujin Tang1* 1Sakana AI, Japan 2Institute of Science Tokyo, Japan EMAIL *Equal contribution |
| Pseudocode | Yes | We illustrate a complete CEM step in the Python pseudocode below. |
| Open Source Code | Yes | We provide our full source code at https://github.com/SakanaAI/self-adaptive-llms. |
| Open Datasets | Yes | To validate the generality of Transformer2 we consider three pre-trained LLMs ranging across different model families and architecture sizes: LLAMA3-8B-INSTRUCT, MISTRAL-7B-INSTRUCTV0.3, and LLAMA3-70B-INSTRUCT. For each model, we obtain three sets of SVF-trained z vectors to maximize performance for GSM8K (Cobbe et al., 2021), MBPP-pro (Austin et al., 2021), and ARC-Easy (Clark et al., 2018), respectively. Additionally, we also train a set of z vectors for LLAMA3-8B-INSTRUCT, when applied as the language backbone for Text VQA (Singh et al., 2019), in order to assess SVF s applicability to the vision-language modeling (VLM) domain. We provide SVF s main learning curves on each of these tasks in Figure 4. Finally, we evaluate the full Transformer2 adaptation framework on four unseen tasks: MATH (Hendrycks et al., 2021), Humaneval (Chen et al., 2021), ARC-Challenge (Clark et al., 2018), and OKVQA (Marino et al., 2019). |
| Dataset Splits | No | We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Python pseudocode' but does not specify version numbers for Python or any other libraries, frameworks, or software used in the experiments. |
| Experiment Setup | Yes | We obtain the expert vectors z as the base components in Transformer2 by training the SVF finetunes with a consistent recipe across the considered training tasks and language models. We divide each dataset to produce equal-sized training and validation splits. We then apply our RL-based approach, optimizing θz with Adam W using a learning rate of 2 10 3 with cosine decay, a batch size of 256, and gradient clipping. We employ early stopping and select the best λ (the coefficient of the KL divergence term) based on validation performance. Table 6: Hyper-parameters used for SVF and Lo RA training. SVF Hyperparameters: Initial mean value of z 0.1, Initial variance value of z 1 10 3, Global batch size 256, Learning rate 2 10 3, Clip max norm 1 10 3, KL coefficient λ 0.0, 0.1, 0.2, 0.3. Lo RA Hyperparameters: Rank 16, Lo RA alpha 32, Lo RA dropout 0.05, Global batch size 256, Learning rate 2 10 4, 5 10 4, 2 10 5, 5 10 5, 2 10 6. 5 10 6, Clip max norm 1 10 3, 1.0. |