ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models
Authors: Xinpeng Dong, Min Zhang, Didi Zhu, Ye Jun Jian, Zhang Keli, Aimin Zhou, Fei Wu, Kun Kuang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that ERICT significantly improves the overall performance, including that of the worst group, and achieves new state-of-the-art results. (Section: Abstract) 6. Experiments |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Zhejiang University, Hangzhou, China 2East China Normal University 3Huawei Noah s Ark Lab. |
| Pseudocode | Yes | C. Pseudocode Algorithm 1: Step 1 of ERICT-C Algorithm 2: Step 2 of ERICT-C |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available or provided in supplementary materials. |
| Open Datasets | Yes | We evaluate our approach on three widely used spurious correlation datasets, including Waterbirds (Sagawa et al., 2019), Celeb A (Liu et al., 2015), and Urbancars (Li et al., 2023b). Imagenet (Deng et al., 2009) is a widely used large-scale vision dataset containing more than 14 million images covering 1,000 categories. |
| Dataset Splits | Yes | For Waterbirds and Celeb A, we follow the setting of previous works (Sarridis et al., 2024; Yang et al., 2024; You et al., 2024). |
| Hardware Specification | Yes | All of our experiments are conducted on a single NVIDIA Ge Force RTX 4090 GPU. |
| Software Dependencies | No | All the images were generated using the default t-SNE parameters from the scikit-learn package. This mention of "scikit-learn package" does not include a specific version number, and no other software dependencies with version numbers are provided. |
| Experiment Setup | Yes | The temperature parameter controls the sharpness of the similarity score matrix distribution, thereby influencing the mask ratio during the inference phase. For ERICT, we use an auxiliary prompt xa t for every task and get the auxiliary text feature by the text encoder. For ERICT-C, whose auxiliary embedding can be obtained through aggregating class prompt embeddings. When the dataset contains a large number of class (e.g., Image Net), ERICT-C adopt a top-K strategy. |