Enhancing Target-unspecific Tasks through a Features Matrix
Authors: Fangming Cui, Yonggang Zhang, Xuan Wang, Xinmei Tian, Jun Yu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks (base-to-novel generalization, domain generalization, and cross-dataset generalization), achieving state-of-the-art performance. We evaluate our method on generalization tasks. As evidenced by systematic benchmarking in Figure 1, our framework, when synergistically integrated with Ma PLe (Khattak et al., 2023a) and Prompt SRC (Khattak et al., 2023b), achieves state-of-the-art performance across three critical generalization dimensions (base-to-novel generalization, domain generalization, and cross-dataset generalization) spanning 11 heterogeneous benchmarks. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Hong Kong Baptist University 3Meituan Inc. 4University of Science and Technology of China 5Harbin Institute of Technology (Shenzhen). Correspondence to: Jun Yu <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in narrative text and provides conceptual diagrams (e.g., Figure 3), but it does not include a dedicated section or figure with structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code is provided or made available, nor does it include a link to a code repository. The closest mention is in the related work section about other tools, but not for the authors' own implementation. |
| Open Datasets | Yes | Datasets. In Table 10, the datasets cover multiple recognition tasks including Image Net (Deng et al., 2009) and Caltech101 (Fei-Fei et al., 2004) which consists of generic objects, Oxford Pets (Parkhi et al., 2012), Stanford Cars (Krause et al., 2013), Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), and FGVCAircraft (Maji et al., 2013) for fine-grained classification, SUN397 (Xiao et al., 2010) for scene recognition, UCF101 (Soomro et al., 2012) for action recognition, DTD (Cimpoi et al., 2014) for texture classification, and Euro SAT (Helber et al., 2019) which consists of satellite images. We leverage datasets such as Image Net A (Hendrycks et al., 2021b), Image Net R (Hendrycks et al., 2021a), Image Net-Sketch (Wang et al., 2019), and Image Net V2 (Recht et al., 2019) to assess the model s performance across different domain distributions. |
| Dataset Splits | Yes | In the base-to-novel generalization task, the datasets are divided into base and novel classes. The model is trained on the base classes in a 16-shot setting, and tested on both the base and novel classes across 11 different datasets. The number of classes for base and novel is the same, which means that all classes in a dataset are evenly divided into two groups of classes. The process of dividing all classes in the dataset is randomly selected. We train our model using the Image Net in 16 shots and we leverage Image Net A, Image Net-R, Image Net-Sketch, and Image Net V2 to assess the model s performance. We train our model with 16 shots on the Image Net dataset and test the model on 10 other unseen datasets. |
| Hardware Specification | No | We use a SGD optimizer with a learning rate of 0.0025 on a single GPU. The compute cost analysis is performed using the SUN397 dataset over 10 epochs on a single GPU. |
| Software Dependencies | No | The paper mentions using the CLIP model and Vi T architecture, but it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor the version of the programming language used. |
| Experiment Setup | Yes | Implementation Details. We employ CLIP (Radford et al., 2021) model based on the Vi T-B/16 architecture. For the Prompt SRC-based and Ma PLe-based frameworks, we set the visual and textual embedding length to 4. We set the easy-to-use module γ to 0.1 and matching scores (top and low) β to 5. Training for 30 epochs for a base-to-novel setting in the first 9 transformer layers. Training for 20 epochs for the domain generalization setting and the cross-dataset evaluation setting in the first 3 transformer layers. We use a SGD optimizer with a learning rate of 0.0025 on a single GPU. |