Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning

Authors: Ervine Zheng, Qi Yu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments We conduct experiments on two captioning datasets. The UCFCAP dataset (Chatzikonstantinou et al. 2022) is an extension of the UCF dataset and contains 2k surveillance videos with descriptions of crimes. The MSR-VTT dataset (Xu et al. 2016) is a diverse dataset for video captioning that consists of over 10k open-domain video clips from 20 categories, and each clip is associated with 20 sentences. ... We evaluate the quality of generated captions based on standard metrics (BLEU-4, ROUGE, and CIDEr scores) (Hossain et al. 2019). ... We conduct an ablation study to compare with alternative settings, including the decoding strategies for caption generation, and varied evidence-based temperatures, to evaluate the contribution of the proposed selection method.
Researcher Affiliation Academia Ervine Zheng, Qi Yu Rochester Institute of Technology EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Training Process for data sample V in unlabeled set do Sample tag using Eq (4) Calculate evidence and sample word using Eq (6) Aggregate uncertainty via Eq (14) Estimate alignment uncertainty via Eq (15) Calculate holistic uncertainty via Eq (16) end for Rank video via holistic uncertainty and query annotation Model update using Eq (9).
Open Source Code Yes The appendix and source code are presented in (Zheng and Yu 2023). https://github.com/ritmininglab/HMSUA/.
Open Datasets Yes Experiments We conduct experiments on two captioning datasets. The UCFCAP dataset (Chatzikonstantinou et al. 2022) is an extension of the UCF dataset and contains 2k surveillance videos with descriptions of crimes. The MSR-VTT dataset (Xu et al. 2016) is a diverse dataset for video captioning that consists of over 10k open-domain video clips from 20 categories, and each clip is associated with 20 sentences.
Dataset Splits Yes Specifically, we train a transformer-based captioning model on the MSR-VTT dataset with a 50/50 train-test split... We used ten percent of the videos and corresponding captions from each dataset to pretrain those modules. After that, the learning process is performed in four rounds, each related to five percent of the data. During one round of the interaction, the model selects five percent of the videos with the highest holistic uncertainty score to collect annotation (i.e., making the ground-truth captions available to the model). Once a batch of data is processed, the model is further trained based on the annotated videos, and evaluated on its performance on the hold-out test set, which includes the remaining data from the dataset.
Hardware Specification No No specific hardware details (GPU/CPU models, memory, etc.) were mentioned in the paper for running the experiments. The paper mentions using "CLIP model" and "GPT2 model" as components but does not specify the hardware used to run their experiments or train their adapters.
Software Dependencies No No specific software dependencies with version numbers were provided. The paper mentions using the CLIP model and GPT2 model, but not the specific software environment or libraries (e.g., PyTorch, TensorFlow, Python version) with their versions used for implementation.
Experiment Setup Yes Hyperparameter λ is set to 0.2. We use stochastic gradient descent and Adam optimizer with a learning rate set to 0.0001.