reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

Authors: Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we first introduce datasets and baselines that relevant to our work, and our implementation details. Then, we validate the effectiveness of ABS on two benchmark with three different backbones, comprising a total of 10 datasets. Finally, through a series of analytical experiments including component ablation, parameter sensitivity and visualization and so on, we showcase the superiority of each module within ABS when compared to alternative approaches.
Researcher Affiliation	Collaboration	1Beijing Institute of Technology 2University of Illinois Urbana-Champaign 3Beihang University 4Kuaishou Technology. Correspondence to: Shuang Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 Attention-Based Selection Require: input image x RH W 3, DINO sampled patches P = {pi}N i=1, Crop size bounds α, β (0, 1) 1: mid fea = CLIP(x, layer = l 1) 2: for each patch p P do 3: Sample crop size: csize U(α, β) 4: # Raw Space Selection: 5: xcrop = ϕ(x, p.center, csize) 6: fraw = CLIP(xcrop) 7: raw crops.append(fraw) 8: # Feature Space Selection: 9: fcrop = ϕ(mid fea, p.center, csize) 10: fresize = Interpolate(fcrop) 11: ffea = CLIP.final layer(fresize) 12: fea crops.append(ffea) 13: end for 14: com fea = Concat(raw crops fea crops)
Open Source Code	Yes	Our code is available at https://github.com/BIT-DA/ABS.
Open Datasets	Yes	Datasets. In alignment with recent studies (Li et al., 2024), we conduct evaluations across two established benchmarks: (1) out-of-distribution generalization and (2) zero-shot classification. For the out-of-distribution generalization, we evaluate our methods on the variants of Image Net. Image Net V2 (Recht et al., 2019) presents a distribution shift that simulates real-world scenarios, while Image Net Sketch (Wang et al., 2019) consists of black-and-white sketches that challenge models to recognize objects based on outlines rather than photographic details. Image Net A (Hendrycks et al., 2021b) includes naturally occurring images that serve as adversarial examples, testing the robustness of classification models against atypical inputs. Lastly, Image Net-R (Hendrycks et al., 2021a) features a diverse set of images that vary in style, blurriness, geographic location, and camera operation, aiming to evaluate the adaptability of models to different visual conditions. For the zero-shot classification benchmark, we adhere to the methodology outlined in (Menon & Vondrick, 2022). This benchmark encompasses several datasets, including Image Net (Deng et al., 2009), a comprehensive object recognition dataset; CUB (Welinder et al.), which focuses on fine-grained bird classification; Oxford Pets (Parkhi et al., 2012), an animal classification dataset; DTD (Cimpoi et al., 2014), a texture recognition dataset; Food101 (Bossard et al., 2014), which contains a diverse range of food images; and Place365 (Zhou et al., 2017), designed for scene classification tasks.
Dataset Splits	Yes	For the zero-shot classification benchmark, we adhere to the methodology outlined in (Menon & Vondrick, 2022). This benchmark encompasses several datasets, including Image Net (Deng et al., 2009), a comprehensive object recognition dataset; CUB (Welinder et al.), which focuses on fine-grained bird classification; Oxford Pets (Parkhi et al., 2012), an animal classification dataset; DTD (Cimpoi et al., 2014), a texture recognition dataset; Food101 (Bossard et al., 2014), which contains a diverse range of food images; and Place365 (Zhou et al., 2017), designed for scene classification tasks.
Hardware Specification	Yes	All experiments are performed on an NVIDIA 4090 GPU.
Software Dependencies	No	Our experiments are conducted using the CLIP model with various backbones, including Vi T-B/32, Vi T-B/16, and Vi T-L/14.
Experiment Setup	Yes	Our method incorporates four key parameters: the crop lower and upper bound (α, β), the top importance of the patch (K), and the number of crops (N). In our study, we maintain consistent parameters across all architectures and datasets. Specifically, we set α = 0.5, β = 0.9, K = 20, N = 60, and M = 50.