ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Authors: Xinpeng Dong, Min Zhang, Didi Zhu, Ye Jun Jian, Zhang Keli, Aimin Zhou, Fei Wu, Kun Kuang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that ERICT significantly improves the overall performance, including that of the worst group, and achieves new state-of-the-art results. (Section: Abstract) 6. Experiments
Researcher Affiliation Collaboration 1Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 2East China Normal University 3Huawei Noah s Ark Lab.
Pseudocode Yes C. Pseudocode Algorithm 1: Step 1 of ERICT-C Algorithm 2: Step 2 of ERICT-C
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available or provided in supplementary materials.
Open Datasets Yes We evaluate our approach on three widely used spurious correlation datasets, including Waterbirds (Sagawa et al., 2019), Celeb A (Liu et al., 2015), and Urbancars (Li et al., 2023b). Imagenet (Deng et al., 2009) is a widely used large-scale vision dataset containing more than 14 million images covering 1,000 categories.
Dataset Splits Yes For Waterbirds and Celeb A, we follow the setting of previous works (Sarridis et al., 2024; Yang et al., 2024; You et al., 2024).
Hardware Specification Yes All of our experiments are conducted on a single NVIDIA Ge Force RTX 4090 GPU.
Software Dependencies No All the images were generated using the default t-SNE parameters from the scikit-learn package. This mention of "scikit-learn package" does not include a specific version number, and no other software dependencies with version numbers are provided.
Experiment Setup Yes The temperature parameter controls the sharpness of the similarity score matrix distribution, thereby influencing the mask ratio during the inference phase. For ERICT, we use an auxiliary prompt xa t for every task and get the auxiliary text feature by the text encoder. For ERICT-C, whose auxiliary embedding can be obtained through aggregating class prompt embeddings. When the dataset contains a large number of class (e.g., Image Net), ERICT-C adopt a top-K strategy.