From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection
Authors: Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we first introduce datasets and baselines that relevant to our work, and our implementation details. Then, we validate the effectiveness of ABS on two benchmark with three different backbones, comprising a total of 10 datasets. Finally, through a series of analytical experiments including component ablation, parameter sensitivity and visualization and so on, we showcase the superiority of each module within ABS when compared to alternative approaches. |
| Researcher Affiliation | Collaboration | 1Beijing Institute of Technology 2University of Illinois Urbana-Champaign 3Beihang University 4Kuaishou Technology. Correspondence to: Shuang Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Attention-Based Selection Require: input image x RH W 3, DINO sampled patches P = {pi}N i=1, Crop size bounds α, β (0, 1) 1: mid fea = CLIP(x, layer = l 1) 2: for each patch p P do 3: Sample crop size: csize U(α, β) 4: # Raw Space Selection: 5: xcrop = ϕ(x, p.center, csize) 6: fraw = CLIP(xcrop) 7: raw crops.append(fraw) 8: # Feature Space Selection: 9: fcrop = ϕ(mid fea, p.center, csize) 10: fresize = Interpolate(fcrop) 11: ffea = CLIP.final layer(fresize) 12: fea crops.append(ffea) 13: end for 14: com fea = Concat(raw crops fea crops) |
| Open Source Code | Yes | Our code is available at https://github.com/BIT-DA/ABS. |
| Open Datasets | Yes | Datasets. In alignment with recent studies (Li et al., 2024), we conduct evaluations across two established benchmarks: (1) out-of-distribution generalization and (2) zero-shot classification. For the out-of-distribution generalization, we evaluate our methods on the variants of Image Net. Image Net V2 (Recht et al., 2019) presents a distribution shift that simulates real-world scenarios, while Image Net Sketch (Wang et al., 2019) consists of black-and-white sketches that challenge models to recognize objects based on outlines rather than photographic details. Image Net A (Hendrycks et al., 2021b) includes naturally occurring images that serve as adversarial examples, testing the robustness of classification models against atypical inputs. Lastly, Image Net-R (Hendrycks et al., 2021a) features a diverse set of images that vary in style, blurriness, geographic location, and camera operation, aiming to evaluate the adaptability of models to different visual conditions. For the zero-shot classification benchmark, we adhere to the methodology outlined in (Menon & Vondrick, 2022). This benchmark encompasses several datasets, including Image Net (Deng et al., 2009), a comprehensive object recognition dataset; CUB (Welinder et al.), which focuses on fine-grained bird classification; Oxford Pets (Parkhi et al., 2012), an animal classification dataset; DTD (Cimpoi et al., 2014), a texture recognition dataset; Food101 (Bossard et al., 2014), which contains a diverse range of food images; and Place365 (Zhou et al., 2017), designed for scene classification tasks. |
| Dataset Splits | Yes | For the zero-shot classification benchmark, we adhere to the methodology outlined in (Menon & Vondrick, 2022). This benchmark encompasses several datasets, including Image Net (Deng et al., 2009), a comprehensive object recognition dataset; CUB (Welinder et al.), which focuses on fine-grained bird classification; Oxford Pets (Parkhi et al., 2012), an animal classification dataset; DTD (Cimpoi et al., 2014), a texture recognition dataset; Food101 (Bossard et al., 2014), which contains a diverse range of food images; and Place365 (Zhou et al., 2017), designed for scene classification tasks. |
| Hardware Specification | Yes | All experiments are performed on an NVIDIA 4090 GPU. |
| Software Dependencies | No | Our experiments are conducted using the CLIP model with various backbones, including Vi T-B/32, Vi T-B/16, and Vi T-L/14. |
| Experiment Setup | Yes | Our method incorporates four key parameters: the crop lower and upper bound (α, β), the top importance of the patch (K), and the number of crops (N). In our study, we maintain consistent parameters across all architectures and datasets. Specifically, we set α = 0.5, β = 0.9, K = 20, N = 60, and M = 50. |