Image Captioning using Facial Expression and Attention
Authors: Omid Mohamad Nezami, Mark Dras, Stephen Wan, Cecile Paris
JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare a comprehensive collection of image captioning models with and without facial features using all standard evaluation metrics. The evaluation metrics indicate that applying facial features with an attention mechanism achieves the best performance, showing more expressive and more correlated image captions, on an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces. |
| Researcher Affiliation | Collaboration | Omid Mohamad Nezami EMAIL Macquarie University, Sydney, NSW, Australia CSIRO s Data61, Sydney, NSW, Australia |
| Pseudocode | No | The paper describes the models using mathematical equations and textual descriptions of the steps involved, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Our dataset splits and labels are publicly available: https://github.com/omidmnezami/Face-Cap |
| Open Datasets | Yes | To train FACE-CAP and FACE-ATTEND, we have extracted a subset of the Flickr 30K dataset with image captions (Young et al., 2014) that we name Flickr Face11K. ... To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset (Goodfellow et al., 2013). ... Our dataset splits and labels are publicly available: https://github.com/omidmnezami/Face-Cap |
| Dataset Splits | Yes | Facial Expression Recognition To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset ... It consists of 35,887 examples (standard splits are 28,709 for training, 3589 for public and 3589 for private test)... For our purposes, we split the standard training set of FER-2013 into two sections after removing 11 completely black examples: 25,109 for training our models and 3589 for development and validation. ... Image Captioning ... We split the Flickr Face11K samples into 8696 for training, 2000 for validation and 1000 for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU or CPU models, for running the experiments. |
| Software Dependencies | No | The Adam optimization algorithm (Kingma & Ba, 2014) is used for optimizing all models. |
| Experiment Setup | Yes | The size of the word embedding layer, initialized via a uniform distribution, is set to 300 except for UP-DOWN and JOINT-FACE-ATT which is set to 512. We fixed 512 dimensions for the memory cell and the hidden state in this work. We use the mini-batch size of 100 and the initial learning rate of 0.001 to train each image captioning model except UP-DOWN and JOINT-FACE-ATT where we set the mini-batch size to 64 and the initial learning rate to 0.005. The Adam optimization algorithm (Kingma & Ba, 2014) is used for optimizing all models. During the training phase, if the model does not have an improvement in METEOR score on the validation set in two successive epochs, we divide the learning rate by two (the minimum learning rate is set to 0.0001) and the previous trained model with the best METEOR is reloaded. ... The epoch limit is set to 30. ... λ and β1 in Equation 14 are empirically set to 0.8 and 0.2, respectively. β2 in Equation 20 is also set to 0.4. Multilayer perceptrons in Equation 6, 13 and 19 use tanh as an activation function. |