Dense Video Object Captioning from Disjoint Supervision
Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our end-to-end trained Dense VOC model outperforms baselines consisting of strong, per-task models by a substantial margin, producing more accurate and inherently temporally consistent captions. Moreover, we achieve significant improvements from our disjoint, multi-dataset training. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on Vid STG and VLN, without explicitly training for it. |
| Researcher Affiliation | Industry | Xingyi Zhou* Anurag Arnab* Chen Sun Cordelia Schmid Google Deep Mind |
| Pseudocode | Yes | Algorithm 1: Identity assignment from association matrix. This greedy algorithm can be implemented efficiently on accelerators, enabling end-to-end training. |
| Open Source Code | Yes | Our code is available at https://github.com/google-research/scenic. |
| Open Datasets | Yes | We use Vid STG (Zhang et al., 2020) and VLN (Voigtlaender et al., 2023)... We use COCO (Lin et al., 2014) as it is the most popular dataset for this task. Here, we use Visual Genome (Krishna et al., 2017b)... In particular, we use Spoken Moments in Time (SMi T) (Monfort et al., 2021)... Ben SMOT (Li et al., 2024b) contains person bounding boxes trajectories and their manually-annotated captions for 3292 You Tube videos. |
| Dataset Splits | Yes | The dataset contains 5,436 training videos and 602 validation videos... The dataset contains a total of 5,136 training and 2,451 validation videos. During disjoint pretraining, we sample batches from different datasets with an even ratio, (1: 1: 1: 1), thus avoiding additional hyperparameters. |
| Hardware Specification | Yes | We train our model on 32 GPUs, which means we have an effective batch size of 256 images or 32 videos... approximately 20 hours on 32, 16GB V100 GPUs... Inference on Vid STG requires 32GB of GPU memory to fit 200 frames. We report the runtime on a 16 GB V100 GPU below for a 64-frame video in Tab. 8. |
| Software Dependencies | No | Our implementation is based on the public release of GRi T (Wu et al., 2022a)... We use the Adam W optimizer. |
| Experiment Setup | Yes | We use a local batch size of either 1 video (consisting of 8 sampled frames), or 8 images. As we use 32 GPUs, this means that our global batch size is either 32 videos or 256 images. We use the Adam W optimizer with a learning rate of 2 10 4, weight decay of 0.05, and a layerwise learning rate decay of 0.7 Li et al. (2022b); Wu et al. (2022a). We train for 22.5 103 iterations per dataset, decreasing the learning rate by a factor of 10 after 90% and 97.5% of the training schedule Wu et al. (2022a). For pretraining on all the 4 datasets in Sec. 3.4, this corresponds to a total of 90 103 iterations... |