LiFT: Learning to Fine-Tune via Bayesian Parameter Efficient Meta Fine-Tuning
Authors: Minyoung Kim, Timothy Hospedales
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Li FT on NLP and vision multi-task meta learning benchmarks. On a range of NLP and Vision tasks, our experimental results show that Li FT outperforms both classic meta-learning methods (MAML (Finn et al., 2017a), Reptile (Nichol et al., 2018), etc.) and previous Bayesian meta-learning algorithms (e.g., BMAML (Yoon et al., 2018) and ABML (Ravi & Beatson, 2019a)) by large margin. Our meta learning approach also exhibits comparable or often superior performance to recent library-based approaches, including Lo RA-Retriever (Zhao et al., 2024b) and the model-based PEFT clustering method (Ostapenko et al., 2024) on cross-task benchmarks. We test our Li FT algorithm on three benchmark datasets from NLP and vision for the cross-task PEFT transfer learning problem: i) Cross-Fit text-to-text generation NLP problem; ii) VTAB image classification/prediction vision problem; and iii) Shakespeare next word prediction NLP problem. The results are summarized in Table 1, Table 2, and Table 3. |
| Researcher Affiliation | Collaboration | Minyoung Kim1 & Timothy M. Hospedales1,2 1Samsung AI Center Cambridge, UK 2University of Edinburgh, UK EMAIL EMAIL |
| Pseudocode | Yes | SGLD-Gibbs Sampling. More specifically, each SGLD-Gibbs step consists of: ϕ ϕ + η2 ϕ log p(ϕ) + η2J + η zϕ (9) θa i θa i + η log p(θa i |ϕ) + log p(Di|θa i ) + η zθa i (10) J J ϕ log p(θa(old) i |ϕ) + ϕ log p(θa(new) i |ϕ) (11) ... Upon observing a posterior sample ϕ(m), our online EM algorithm updates the parameters of the mixture by the following equations (Detailed derivations can be found in Appendix B): (E-step) Compute the following component assignment probabilities at the current mixture: qj = αj N(ϕ(m); µj, Σj) PK j =1 αj N(ϕ(m); µj , Σj ) (j = 1, . . . , K) (13) (M-step) Update the mixture parameters as follows (for j = 1, . . . , K): αj nj+qj 1+P j nj , µj njµj+qjϕ(m) nj+qj , Sj nj Sj+qjϕ(m)ϕ(m) nj+qj , Σj = Sj µjµ j (followed by the update: nj nj + qj) (14) |
| Open Source Code | No | The paper does not provide a direct link to a code repository, an explicit statement of code release, or mention of code in supplementary materials for the methodology described in this paper. |
| Open Datasets | Yes | We test our Li FT algorithm on three benchmark datasets from NLP and vision for the cross-task PEFT transfer learning problem: i) Cross-Fit text-to-text generation NLP problem; ii) VTAB image classification/prediction vision problem; and iii) Shakespeare next word prediction NLP problem. Following the seminal work from Ye et al. (2021) on the cross-task NLP benchmark... Next we test our Li FT on the cross-task vision problem formed using the VTAB-1K benchmark (Zhai et al., 2019)... From the LEAF benchmark (Caldas et al., 2019), we collect Shakespeare play lines... |
| Dataset Splits | Yes | There are 160 NLP tasks in the text-to-text format. Each task consists of train/dev/test sets, where train and dev datasets contain 32 examples, and test datasets have the rest. ... Each of the 19 datasets consists of 1K training examples, and we use the conventional splits of train 80% and validation 20%. ... For each split, we randomly have 60%/10%/30% and 80%/10%/10% train/dev/test task split. |
| Hardware Specification | Yes | All experiments are conducted on a single V100 GPU. ... We ran all methods on a single A100 80GB GPU where i-MAML (and its mixtures) and BMAML baselines incurred the out-of-memory issue. Our Li FT model runs well, and as shown we have large improvements over the competing methods. ... More than 5 iterations in MAML incurred the out-of-memory issues on a V100 GPU. |
| Software Dependencies | No | The paper mentions software components like "Hugging Face" and "BART-base checkpoint", and "Image Net-pre-trained Vi T-B/16", but does not specify their version numbers or other key software dependencies with specific versions. |
| Experiment Setup | Yes | For the scale hyperparameters in our Li FT model, we always use σ = 0.01 (scale of prior p(ϕ)) and β = 0.01 (scale of p(θi|ϕ)). ... The learning rates are 10 3 for the NLP and 10 4 for the VTAB. We use the batch size 16 with 20K steps for the NLP tasks while we take batch size 128 and 10K steps for the vision task. For the meta learning baselines (e.g., MAML, its variants and Reptile), we use similar hyperparameters, but the learning rates are adjusted for numerical stability. The inner loop learning rates are typically chosen 5 times the outer learning rate. |