reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchical Multi-Source Uncertainty Aggregation for Interactive Video Captioning

Authors: Ervine Zheng, Qi Yu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments We conduct experiments on two captioning datasets. The UCFCAP dataset (Chatzikonstantinou et al. 2022) is an extension of the UCF dataset and contains 2k surveillance videos with descriptions of crimes. The MSR-VTT dataset (Xu et al. 2016) is a diverse dataset for video captioning that consists of over 10k open-domain video clips from 20 categories, and each clip is associated with 20 sentences. ... We evaluate the quality of generated captions based on standard metrics (BLEU-4, ROUGE, and CIDEr scores) (Hossain et al. 2019). ... We conduct an ablation study to compare with alternative settings, including the decoding strategies for caption generation, and varied evidence-based temperatures, to evaluate the contribution of the proposed selection method.
Researcher Affiliation	Academia	Ervine Zheng, Qi Yu Rochester Institute of Technology EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Training Process for data sample V in unlabeled set do Sample tag using Eq (4) Calculate evidence and sample word using Eq (6) Aggregate uncertainty via Eq (14) Estimate alignment uncertainty via Eq (15) Calculate holistic uncertainty via Eq (16) end for Rank video via holistic uncertainty and query annotation Model update using Eq (9).
Open Source Code	Yes	The appendix and source code are presented in (Zheng and Yu 2023). https://github.com/ritmininglab/HMSUA/.
Open Datasets	Yes	Experiments We conduct experiments on two captioning datasets. The UCFCAP dataset (Chatzikonstantinou et al. 2022) is an extension of the UCF dataset and contains 2k surveillance videos with descriptions of crimes. The MSR-VTT dataset (Xu et al. 2016) is a diverse dataset for video captioning that consists of over 10k open-domain video clips from 20 categories, and each clip is associated with 20 sentences.
Dataset Splits	Yes	Specifically, we train a transformer-based captioning model on the MSR-VTT dataset with a 50/50 train-test split... We used ten percent of the videos and corresponding captions from each dataset to pretrain those modules. After that, the learning process is performed in four rounds, each related to five percent of the data. During one round of the interaction, the model selects five percent of the videos with the highest holistic uncertainty score to collect annotation (i.e., making the ground-truth captions available to the model). Once a batch of data is processed, the model is further trained based on the annotated videos, and evaluated on its performance on the hold-out test set, which includes the remaining data from the dataset.
Hardware Specification	No	No specific hardware details (GPU/CPU models, memory, etc.) were mentioned in the paper for running the experiments. The paper mentions using "CLIP model" and "GPT2 model" as components but does not specify the hardware used to run their experiments or train their adapters.
Software Dependencies	No	No specific software dependencies with version numbers were provided. The paper mentions using the CLIP model and GPT2 model, but not the specific software environment or libraries (e.g., PyTorch, TensorFlow, Python version) with their versions used for implementation.
Experiment Setup	Yes	Hyperparameter λ is set to 0.2. We use stochastic gradient descent and Adam optimizer with a learning rate set to 0.0001.