reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment

Authors: Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, Jinwoo Shin

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the robust generalization of Star FT and its emerging properties: zeroshot group robustness and improved zero-shot classiﬁcation. Notably, Star FT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust ﬁne-tuning baselines show even degraded performance. The paper also includes sections like '4 Experiments', '4.1 Evaluation on domain shifts', '4.2 Evaluation on group shifts', '4.3 Zero-shot classiﬁcation', '4.4 Transfer learning', and '4.5 Ablation study', all of which describe empirical evaluation with data analysis and performance metrics.
Researcher Affiliation	Collaboration	Younghyun Kim1 , Jongheon Jeong2 , Sangkyung Kwak3 , Kyungmin Lee4 , Juho Lee4 and Jinwoo Shin4 1Samsung 2Korea University 3General Robotics 4KAIST EMAIL, EMAIL, EMAIL, EMAIL. The affiliations include Samsung and General Robotics (industry), and Korea University and KAIST (academia), indicating a collaboration.
Pseudocode	No	The paper describes the proposed method, Star FT, using mathematical equations (Eq. 1, 2, 3) and descriptive text in Section 3.2 'Star FT: Fine-tuning with Spurious Textual Alignment Regularization'. It also includes Figure 1 'Overview of Star FT' which is a diagram, but there are no structured pseudocode or algorithm blocks present.
Open Source Code	Yes	1Code: https://github.com/alinlab/Star FT (Footnote 1 in the 'Contribution' subsection of the '1 Introduction' section provides a direct link to the source code repository).
Open Datasets	Yes	We train Star FT on Image Net (IN) [Russakovsky et al., 2015], which comprises over a million natural images of 1,000 classes. We then evaluate our ﬁnetuned models on 4 well-known Image Net OOD benchmarks: Image Net-R (IN-R) [Hendrycks et al., 2021a], Image Net A (IN-A) [Hendrycks et al., 2021b], Image Net-Sketch (IN-S) [Wang et al., 2019], and Image Net V2 (IN-V2) [Recht et al., 2019]. We further evaluate our Image Net ﬁne-tuned models on three different group shift benchmarks: Waterbirds [Sagawa et al., 2020], PACS [Li et al., 2017], and CIFAR-10.02 [Zhang and R e, 2022]. We also conduct evaluation on zero-shot classiﬁcation from the Image Net ﬁne-tuned models on 4 natural image benchmarks: CIFAR-10 [Krizhevsky, 2009], CIFAR100 [Krizhevsky, 2009], Caltech101 [Li et al., 2022], and STL10 [Coates et al., 2011]. We use 6 common object classiﬁcation datasets: Caltech101 [Li et al., 2022], Stanford Cars [Krause et al., 2013], Flowers102 [Nilsback and Zisserman, 2008], Image Net [Russakovsky et al., 2015], WILDSi WILDCam [Beery et al., 2020; Koh et al., 2021], and WILDS-FMo W [Christie et al., 2018; Koh et al., 2021]. All these datasets are well-known and cited, indicating public availability.
Dataset Splits	Yes	OOD datasets are only used for evaluation, where we select the best-performing model based on ID validation set accuracy. The paper implicitly refers to standard dataset splits (training, validation, test) used in well-established benchmarks like ImageNet, Waterbirds, PACS, and CIFAR, which are common practice in the field.
Hardware Specification	No	The paper mentions using CLIP Vi T-B{16 and Vi T-L{14 models and training with specific batch sizes ('batch size of 512 for Image Net, while all other datasets use a batch size of 256'). However, it does not specify any particular GPU models, CPU types, or other hardware components used to conduct the experiments.
Software Dependencies	No	The paper mentions utilizing 'CLIP' models and the 'Adam W' optimizer, but it does not provide specific version numbers for any programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other relevant libraries.
Experiment Setup	Yes	Throughout experiments, we utilize CLIP [Radford et al., 2021] Vi T-B{16 and Vi T-L{14 trained on the LAION dataset [Schuhmann et al., 2021] and ﬁne-tune the model using Adam W [Kingma and Ba, 2017] optimizer with a cosine learning rate scheduler. We train models with a batch size of 512 for Image Net, while all other datasets use a batch size of 256. In our experiments, we use λStar 0.5 by default. This hyperparamter λStar is linearly decayed during the course of ﬁne-tuning.