Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers

Authors: Lokesh Veeramacheneni, Moritz Wolter, Hilde Kuehne, Juergen Gall

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, Ca RA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as the Visual Task Adaptation Benchmark (VTAB)-1k and the Fine-Grained Visual Categorization (FGVC) benchmark.
Researcher Affiliation Collaboration 1University of Bonn 2Tuebingen AI Center 3MIT-IBM Watson AI Lab 4Lamarr Institute for Machine Learning and Artificial Intelligence. Correspondence to: Lokesh Veeramacheneni <EMAIL>.
Pseudocode No The paper provides mathematical derivations for gradients in Section 3.3 and Appendix A, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at https://github.com/Bonn Bytes/Ca RA.
Open Datasets Yes To evaluate the performance of Ca RA, we follow the experimental setup from (Jia et al., 2022) and benchmark on all VTAB-1k datasets (Zhai et al., 2019). FGVC is a collection of five large datasets: CUB-200-2011, NABirds, Oxford Flowers, Stanford Dogs and Stanford Cats. Following Kopiczko et al. (2024), we fine-tune on CIFAR100, Food101, Flowers102, and Resisc45 using 10 randomly sampled training examples per class.
Dataset Splits Yes Ca RA is trained on a subset of 1000 samples with an 80-20 split for training and validation, while the original test set is used for evaluation. The validation split is done with statistics from (Jia et al., 2022) with seed 0. Following Kopiczko et al. (2024), we fine-tune on CIFAR100, Food101, Flowers102, and Resisc45 using 10 randomly sampled training examples per class. Evaluation is performed on the CIFAR100, Food101, and Flowers102 test sets, and on the remaining samples for Resisc45. Further implementation and hyperparameter details are provided in Section C.4. ... We use the numpy random choice with seed 6 for sampling to ensure reproducibility.
Hardware Specification Yes We fine-tuned the Vi T on one Nvidia GA100 GPU for the VTAB-1k benchmark and one Nvidia H100 GPU for the FGVC benchmark. For evaluation, we use an Nvidia RTX A5000. In the case of language experiments, we use a maximum of 8 Nvidia GA100 GPUs for fine-tuning and evaluation.
Software Dependencies No The paper mentions software like Py Torch (Paszke et al., 2017), Tensorly (Kossaifi et al., 2019), and Py Torch Image Models (Wightman, 2019) but does not provide specific version numbers for these software components.
Experiment Setup Yes We present the hyperparameters, such as rank, in Table 8 of the Appendix, and additional information on the datasets is further provided in Section C.2 of the Appendix. Ca RA is trained with rank 32 across all the datasets. Section C.3 of the Appendix provides more details about hyperparameters. Table 8: Hyperparameter details for the VTAB-1k benchmark using the Vi T-Base model. The standard deviation (std) is computed over 10 runs. Table 9: Hyperparameter details for the FGVC benchmark using the Vi T-Base model. Table 10: Hyperparameter details for four image classification datasets using the Vi T-Large model. The standard deviation (std) is computed over five runs.