HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning
Authors: Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on You Cook2 and Vi TT datasets.To evaluate the effectiveness of our method, we conducted comparative experiments. Section 4.1 describes the experimental setup used in this study. Section 4.2 highlights the role of memory retrieval in DVC. Section 4.3 presents a comparison with state-of-the-art methods. Section 4.4 provides ablation studies to validate the contribution of each component in our model. We also provide qualitative results. |
| Researcher Affiliation | Academia | 1Kyung Hee University, Republic of Korea 2Electronics and Telecommunications Research Institute (ETRI), Republic of Korea EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using text and figures, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We utilized two DVC benchmark datasets, You Cook2 (Zhou, Xu, and Corso 2018) and Vi TT (Huang et al. 2020), for both training and evaluation. |
| Dataset Splits | No | The You Cook2 dataset includes 2,000 untrimmed videos depicting cooking procedures, with an average duration of 320 seconds per video and 7.7 temporally localized sentences per annotation. We employed the standard dataset split for training, validation, and testing purposes. The Vi TT dataset includes 8,000 untrimmed instructional videos, each with an average length of 250 seconds and annotated with 7.1 temporally localized short tags. It is worth noting that we utilized approximately 10% 20% fewer videos than the original dataset, as we only included those accessible on You Tube (Yang et al. 2023b). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions several models and tools like CLIP Vi T-L/14, T5-Base model, Llama3 70B model, and Sentence Piece tokenizer, but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For both datasets, video frames were extracted at a rate of 1 frame per second. The sequences were then either subsampled or padded to ensure a total of F frames, where F was set to 100. Both the text encoder and decoder are initialized using a pretrained T5-Base model (Raffel et al. 2020). The hierarchical memory is organized into 4 levels for You Cook2 and 5 layers for Vi TT. For You Cook2, we use 10,337 sentences to construct compact memory across levels, with {1758, 313, 68, 8} memory units allocated per level. For Vi TT, we utilize 34,599 sentences, distributing compact memory across levels as follows: {4487, 741, 114, 12, 3} memory units per level. For You Cook2, the number of anchors is set to 10 and the number of retrieved features is set to 10 for each anchor level. For Vi TT, the number of anchors is set to 30 and the number of retrieved features is set to 30 for each anchor level. ... Our approach, based on the sequence-to-sequence structure of the Vid2Seq model (Yang et al. 2023b), fine-tunes a model pre-trained on approximately 1.8 million videos (Yang et al. 2023a). ... we add b time tokens, thus expanding the tokenizer to encompass v + b tokens. These time tokens denote relative timestamps within a video, which we achieve by dividing a video into b evenly spaced timestamps. Specifically, we employ the Sentence Piece tokenizer (Kudo and Richardson 2018), which features a vocabulary size of v = 32, 128 and includes b = 100 time tokens. |