StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment

Authors: Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, Jinwoo Shin

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the robust generalization of Star FT and its emerging properties: zeroshot group robustness and improved zero-shot classification. Notably, Star FT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance. The paper also includes sections like '4 Experiments', '4.1 Evaluation on domain shifts', '4.2 Evaluation on group shifts', '4.3 Zero-shot classification', '4.4 Transfer learning', and '4.5 Ablation study', all of which describe empirical evaluation with data analysis and performance metrics.
Researcher Affiliation Collaboration Younghyun Kim1 , Jongheon Jeong2 , Sangkyung Kwak3 , Kyungmin Lee4 , Juho Lee4 and Jinwoo Shin4 1Samsung 2Korea University 3General Robotics 4KAIST EMAIL, EMAIL, EMAIL, EMAIL. The affiliations include Samsung and General Robotics (industry), and Korea University and KAIST (academia), indicating a collaboration.
Pseudocode No The paper describes the proposed method, Star FT, using mathematical equations (Eq. 1, 2, 3) and descriptive text in Section 3.2 'Star FT: Fine-tuning with Spurious Textual Alignment Regularization'. It also includes Figure 1 'Overview of Star FT' which is a diagram, but there are no structured pseudocode or algorithm blocks present.
Open Source Code Yes 1Code: https://github.com/alinlab/Star FT (Footnote 1 in the 'Contribution' subsection of the '1 Introduction' section provides a direct link to the source code repository).
Open Datasets Yes We train Star FT on Image Net (IN) [Russakovsky et al., 2015], which comprises over a million natural images of 1,000 classes. We then evaluate our finetuned models on 4 well-known Image Net OOD benchmarks: Image Net-R (IN-R) [Hendrycks et al., 2021a], Image Net A (IN-A) [Hendrycks et al., 2021b], Image Net-Sketch (IN-S) [Wang et al., 2019], and Image Net V2 (IN-V2) [Recht et al., 2019]. We further evaluate our Image Net fine-tuned models on three different group shift benchmarks: Waterbirds [Sagawa et al., 2020], PACS [Li et al., 2017], and CIFAR-10.02 [Zhang and R e, 2022]. We also conduct evaluation on zero-shot classification from the Image Net fine-tuned models on 4 natural image benchmarks: CIFAR-10 [Krizhevsky, 2009], CIFAR100 [Krizhevsky, 2009], Caltech101 [Li et al., 2022], and STL10 [Coates et al., 2011]. We use 6 common object classification datasets: Caltech101 [Li et al., 2022], Stanford Cars [Krause et al., 2013], Flowers102 [Nilsback and Zisserman, 2008], Image Net [Russakovsky et al., 2015], WILDSi WILDCam [Beery et al., 2020; Koh et al., 2021], and WILDS-FMo W [Christie et al., 2018; Koh et al., 2021]. All these datasets are well-known and cited, indicating public availability.
Dataset Splits Yes OOD datasets are only used for evaluation, where we select the best-performing model based on ID validation set accuracy. The paper implicitly refers to standard dataset splits (training, validation, test) used in well-established benchmarks like ImageNet, Waterbirds, PACS, and CIFAR, which are common practice in the field.
Hardware Specification No The paper mentions using CLIP Vi T-B{16 and Vi T-L{14 models and training with specific batch sizes ('batch size of 512 for Image Net, while all other datasets use a batch size of 256'). However, it does not specify any particular GPU models, CPU types, or other hardware components used to conduct the experiments.
Software Dependencies No The paper mentions utilizing 'CLIP' models and the 'Adam W' optimizer, but it does not provide specific version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup Yes Throughout experiments, we utilize CLIP [Radford et al., 2021] Vi T-B{16 and Vi T-L{14 trained on the LAION dataset [Schuhmann et al., 2021] and fine-tune the model using Adam W [Kingma and Ba, 2017] optimizer with a cosine learning rate scheduler. We train models with a batch size of 512 for Image Net, while all other datasets use a batch size of 256. In our experiments, we use λStar 0.5 by default. This hyperparamter λStar is linearly decayed during the course of fine-tuning.