Enhance Vision-Language Alignment with Noise
Authors: Sida Huang, Hongyuan Zhang, Xuelong Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The evaluation across 11 datasets demonstrates its effectiveness. 4 Experiments |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, OPtics and Electro Nics (iOPEN), Northwestern Polytechnical University, Xi an 710072, P. R. China 2Institute of Artificial Intelligence (Tele AI), China Telecom, P. R. China 3The University of Hong Kong |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations but does not contain a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/hyzhang98/Pi NI |
| Open Datasets | Yes | To evaluate the performance of Pi NI, 11 datasets covering a wide range of visual concepts are selected. They include two generic object datasets, Image Net (Deng et al. 2009) and Caltech101 (Fei-Fei, Fergus, and Perona 2004); five fine-grained datasets, Oxford Pets (Parkhi et al. 2012), Stanford Cars (Krause et al. 2013), Flowers102 (Nilsback and Zisserman 2008), Food101 (Bossard, Guillaumin, and Van Gool 2014) and FGVCAircraft (Maji et al. 2013), which contain fine-grained categories of pets, cars, flowers, food and aircraft, respectively. The other datasets are scene recognition dataset SUN397 (Xiao et al. 2010), action recognition dataset UCF101 (Soomro, Zamir, and Shah 2012), describable textures dataset DTD (Cimpoi et al. 2014) and Euro SAT (Helber et al. 2019) which contains satellite images. |
| Dataset Splits | Yes | In the few-shot learning experiments, the train dataset is randomly sampled with 1, 2, 4, 8, and 16 shots per category. The model is tested on all data in the test dataset. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments. It only mentions model architectures like Vi T-B/16 and RN-50. |
| Software Dependencies | No | The paper does not provide specific software dependencies or their version numbers. |
| Experiment Setup | Yes | The noise sample number m in Eq. (13) is set to 1. To ensure the fairness of the experiment, the best-performing Vi T-B/16 is selected as the visual encoder unless otherwise noted. The default parameter configurations are used for these baselines. |