reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HiCM²: Hierarchical Compact Memory Modeling for Dense Video Captioning

Authors: Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on You Cook2 and Vi TT datasets.To evaluate the effectiveness of our method, we conducted comparative experiments. Section 4.1 describes the experimental setup used in this study. Section 4.2 highlights the role of memory retrieval in DVC. Section 4.3 presents a comparison with state-of-the-art methods. Section 4.4 provides ablation studies to validate the contribution of each component in our model. We also provide qualitative results.
Researcher Affiliation	Academia	1Kyung Hee University, Republic of Korea 2Electronics and Telecommunications Research Institute (ETRI), Republic of Korea EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using text and figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We utilized two DVC benchmark datasets, You Cook2 (Zhou, Xu, and Corso 2018) and Vi TT (Huang et al. 2020), for both training and evaluation.
Dataset Splits	No	The You Cook2 dataset includes 2,000 untrimmed videos depicting cooking procedures, with an average duration of 320 seconds per video and 7.7 temporally localized sentences per annotation. We employed the standard dataset split for training, validation, and testing purposes. The Vi TT dataset includes 8,000 untrimmed instructional videos, each with an average length of 250 seconds and annotated with 7.1 temporally localized short tags. It is worth noting that we utilized approximately 10% 20% fewer videos than the original dataset, as we only included those accessible on You Tube (Yang et al. 2023b).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	The paper mentions several models and tools like CLIP Vi T-L/14, T5-Base model, Llama3 70B model, and Sentence Piece tokenizer, but does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For both datasets, video frames were extracted at a rate of 1 frame per second. The sequences were then either subsampled or padded to ensure a total of F frames, where F was set to 100. Both the text encoder and decoder are initialized using a pretrained T5-Base model (Raffel et al. 2020). The hierarchical memory is organized into 4 levels for You Cook2 and 5 layers for Vi TT. For You Cook2, we use 10,337 sentences to construct compact memory across levels, with {1758, 313, 68, 8} memory units allocated per level. For Vi TT, we utilize 34,599 sentences, distributing compact memory across levels as follows: {4487, 741, 114, 12, 3} memory units per level. For You Cook2, the number of anchors is set to 10 and the number of retrieved features is set to 10 for each anchor level. For Vi TT, the number of anchors is set to 30 and the number of retrieved features is set to 30 for each anchor level. ... Our approach, based on the sequence-to-sequence structure of the Vid2Seq model (Yang et al. 2023b), fine-tunes a model pre-trained on approximately 1.8 million videos (Yang et al. 2023a). ... we add b time tokens, thus expanding the tokenizer to encompass v + b tokens. These time tokens denote relative timestamps within a video, which we achieve by dividing a video into b evenly spaced timestamps. Specifically, we employ the Sentence Piece tokenizer (Kudo and Richardson 2018), which features a vocabulary size of v = 32, 128 and includes b = 100 time tokens.