Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification

Authors: Ardhendu Behera, Zachary Wharton, Pradeep R P G Hewage, Asish Bera929-937

AAAI 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach using six state-of-the-art (Sot A) backbone networks and eight benchmark datasets. Our method significantly outperforms the Sot A approaches on six datasets and is very competitive with the remaining two.
Researcher Affiliation Academia Ardhendu Behera, Zachary Wharton, Pradeep R P G Hewage and Asish Bera Department of Computer Science, Edge Hill University St Helen Road, Lancashire United Kingdom, L39 4QP EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the approach in text and figures, but does not include explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://ardhendubehera.github.io/cap/.
Open Datasets Yes We comprehensively evaluate our model on widely used eight benchmark FGVC datasets: Aircraft (Maji et al. 2013), Food-101 (Bossard, Guillaumin, and Gool 2014), Stanford Cars (Krause et al. 2013), Stanford Dogs (Khosla et al. 2011), Caltech Birds (CUB-200) (Wah et al. 2011), Oxford Flower (Nilsback and Zisserman 2008), Oxford-IIIT Pets (Parkhi et al. 2012), and NABirds (Van Horn et al. 2015).
Dataset Splits Yes Statistics of datasets and their train/test splits are shown in Table 1. We use the top-1 accuracy (%) for evaluation. Experimental settings: In all our experiments, we resize images to size 256 × 256, apply data augmentation techniques of random rotation (±15 degrees), random scaling (1 ± 0.15) and then random cropping to select the final size of 224 × 224 from 256 × 256.
Hardware Specification Yes The model is trained for 150 epochs using an NVIDIA Titan V GPU (12 GB).
Software Dependencies No We use Keras+Tensorflow to implement our algorithm. The paper does not specify version numbers for these software dependencies.
Experiment Setup Yes In all our experiments, we resize images to size 256 × 256, apply data augmentation techniques of random rotation (±15 degrees), random scaling (1 ± 0.15) and then random cropping to select the final size of 224 × 224 from 256 × 256. We set the cluster size to 32 in our learnable pooling approach. We apply Stochastic Gradient Descent (SGD) optimizer to optimize the categorical cross-entropy loss function. The SGD is initialized with a momentum of 0.99, and an initial learning rate 1e-4, which is multiplied by 0.1 after every 50 epochs. The model is trained for 150 epochs using an NVIDIA Titan V GPU (12 GB).