reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dense Video Object Captioning from Disjoint Supervision

Authors: Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our end-to-end trained Dense VOC model outperforms baselines consisting of strong, per-task models by a substantial margin, producing more accurate and inherently temporally consistent captions. Moreover, we achieve significant improvements from our disjoint, multi-dataset training. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on Vid STG and VLN, without explicitly training for it.
Researcher Affiliation	Industry	Xingyi Zhou* Anurag Arnab* Chen Sun Cordelia Schmid Google Deep Mind
Pseudocode	Yes	Algorithm 1: Identity assignment from association matrix. This greedy algorithm can be implemented efficiently on accelerators, enabling end-to-end training.
Open Source Code	Yes	Our code is available at https://github.com/google-research/scenic.
Open Datasets	Yes	We use Vid STG (Zhang et al., 2020) and VLN (Voigtlaender et al., 2023)... We use COCO (Lin et al., 2014) as it is the most popular dataset for this task. Here, we use Visual Genome (Krishna et al., 2017b)... In particular, we use Spoken Moments in Time (SMi T) (Monfort et al., 2021)... Ben SMOT (Li et al., 2024b) contains person bounding boxes trajectories and their manually-annotated captions for 3292 You Tube videos.
Dataset Splits	Yes	The dataset contains 5,436 training videos and 602 validation videos... The dataset contains a total of 5,136 training and 2,451 validation videos. During disjoint pretraining, we sample batches from different datasets with an even ratio, (1: 1: 1: 1), thus avoiding additional hyperparameters.
Hardware Specification	Yes	We train our model on 32 GPUs, which means we have an effective batch size of 256 images or 32 videos... approximately 20 hours on 32, 16GB V100 GPUs... Inference on Vid STG requires 32GB of GPU memory to fit 200 frames. We report the runtime on a 16 GB V100 GPU below for a 64-frame video in Tab. 8.
Software Dependencies	No	Our implementation is based on the public release of GRi T (Wu et al., 2022a)... We use the Adam W optimizer.
Experiment Setup	Yes	We use a local batch size of either 1 video (consisting of 8 sampled frames), or 8 images. As we use 32 GPUs, this means that our global batch size is either 32 videos or 256 images. We use the Adam W optimizer with a learning rate of 2 10 4, weight decay of 0.05, and a layerwise learning rate decay of 0.7 Li et al. (2022b); Wu et al. (2022a). We train for 22.5 103 iterations per dataset, decreasing the learning rate by a factor of 10 after 90% and 97.5% of the training schedule Wu et al. (2022a). For pretraining on all the 4 datasets in Sec. 3.4, this corresponds to a total of 90 103 iterations...