reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Image Captioning using Facial Expression and Attention

Authors: Omid Mohamad Nezami, Mark Dras, Stephen Wan, Cecile Paris

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare a comprehensive collection of image captioning models with and without facial features using all standard evaluation metrics. The evaluation metrics indicate that applying facial features with an attention mechanism achieves the best performance, showing more expressive and more correlated image captions, on an image caption dataset extracted from the standard Flickr 30K dataset, consisting of around 11K images containing faces.
Researcher Affiliation	Collaboration	Omid Mohamad Nezami EMAIL Macquarie University, Sydney, NSW, Australia CSIRO s Data61, Sydney, NSW, Australia
Pseudocode	No	The paper describes the models using mathematical equations and textual descriptions of the steps involved, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Our dataset splits and labels are publicly available: https://github.com/omidmnezami/Face-Cap
Open Datasets	Yes	To train FACE-CAP and FACE-ATTEND, we have extracted a subset of the Flickr 30K dataset with image captions (Young et al., 2014) that we name Flickr Face11K. ... To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset (Goodfellow et al., 2013). ... Our dataset splits and labels are publicly available: https://github.com/omidmnezami/Face-Cap
Dataset Splits	Yes	Facial Expression Recognition To train our facial expression recognition model, we use the facial expression recognition 2013 (FER-2013) dataset ... It consists of 35,887 examples (standard splits are 28,709 for training, 3589 for public and 3589 for private test)... For our purposes, we split the standard training set of FER-2013 into two sections after removing 11 completely black examples: 25,109 for training our models and 3589 for development and validation. ... Image Captioning ... We split the Flickr Face11K samples into 8696 for training, 2000 for validation and 1000 for testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU or CPU models, for running the experiments.
Software Dependencies	No	The Adam optimization algorithm (Kingma & Ba, 2014) is used for optimizing all models.
Experiment Setup	Yes	The size of the word embedding layer, initialized via a uniform distribution, is set to 300 except for UP-DOWN and JOINT-FACE-ATT which is set to 512. We fixed 512 dimensions for the memory cell and the hidden state in this work. We use the mini-batch size of 100 and the initial learning rate of 0.001 to train each image captioning model except UP-DOWN and JOINT-FACE-ATT where we set the mini-batch size to 64 and the initial learning rate to 0.005. The Adam optimization algorithm (Kingma & Ba, 2014) is used for optimizing all models. During the training phase, if the model does not have an improvement in METEOR score on the validation set in two successive epochs, we divide the learning rate by two (the minimum learning rate is set to 0.0001) and the previous trained model with the best METEOR is reloaded. ... The epoch limit is set to 30. ... λ and β1 in Equation 14 are empirically set to 0.8 and 0.2, respectively. β2 in Equation 20 is also set to 0.4. Multilayer perceptrons in Equation 6, 13 and 19 use tanh as an activation function.