Dense Video Object Captioning from Disjoint Supervision

Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our end-to-end trained Dense VOC model outperforms baselines consisting of strong, per-task models by a substantial margin, producing more accurate and inherently temporally consistent captions. Moreover, we achieve significant improvements from our disjoint, multi-dataset training. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on Vid STG and VLN, without explicitly training for it.
Researcher Affiliation Industry Xingyi Zhou* Anurag Arnab* Chen Sun Cordelia Schmid Google Deep Mind
Pseudocode Yes Algorithm 1: Identity assignment from association matrix. This greedy algorithm can be implemented efficiently on accelerators, enabling end-to-end training.
Open Source Code Yes Our code is available at https://github.com/google-research/scenic.
Open Datasets Yes We use Vid STG (Zhang et al., 2020) and VLN (Voigtlaender et al., 2023)... We use COCO (Lin et al., 2014) as it is the most popular dataset for this task. Here, we use Visual Genome (Krishna et al., 2017b)... In particular, we use Spoken Moments in Time (SMi T) (Monfort et al., 2021)... Ben SMOT (Li et al., 2024b) contains person bounding boxes trajectories and their manually-annotated captions for 3292 You Tube videos.
Dataset Splits Yes The dataset contains 5,436 training videos and 602 validation videos... The dataset contains a total of 5,136 training and 2,451 validation videos. During disjoint pretraining, we sample batches from different datasets with an even ratio, (1: 1: 1: 1), thus avoiding additional hyperparameters.
Hardware Specification Yes We train our model on 32 GPUs, which means we have an effective batch size of 256 images or 32 videos... approximately 20 hours on 32, 16GB V100 GPUs... Inference on Vid STG requires 32GB of GPU memory to fit 200 frames. We report the runtime on a 16 GB V100 GPU below for a 64-frame video in Tab. 8.
Software Dependencies No Our implementation is based on the public release of GRi T (Wu et al., 2022a)... We use the Adam W optimizer.
Experiment Setup Yes We use a local batch size of either 1 video (consisting of 8 sampled frames), or 8 images. As we use 32 GPUs, this means that our global batch size is either 32 videos or 256 images. We use the Adam W optimizer with a learning rate of 2 10 4, weight decay of 0.05, and a layerwise learning rate decay of 0.7 Li et al. (2022b); Wu et al. (2022a). We train for 22.5 103 iterations per dataset, decreasing the learning rate by a factor of 10 after 90% and 97.5% of the training schedule Wu et al. (2022a). For pretraining on all the 4 datasets in Sec. 3.4, this corresponds to a total of 90 103 iterations...