Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Authors: Yunbin Tu, Liang Li, Li Su, Qingming Huang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization. |
| Researcher Affiliation | Academia | Yunbin Tu1, Liang Li2,1*, Li Su1,3*, Qingming Huang1 1 School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China 2 Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 3Peng Cheng Laboratory, Shenzhen, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using narrative text and mathematical formulations (e.g., equations 1-15) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/tuyunbin/QUAG |
| Open Datasets | Yes | HIREST consists of the tasks of video retrieval, moment retrieval, moment segmentation, and step-captioning. It is comprised of 3.4K text-video pairs, 1.8K moments, and 8.6K step captions. We use the official split with 1,507 video-query pairs for training, 477 video-query pairs for validation and 1,391 video-query pairs for testing. |
| Dataset Splits | Yes | HIREST: We use the official split with 1,507 video-query pairs for training, 477 video-query pairs for validation and 1,391 video-query pairs for testing. TVSum: For a fair-comparison, we follow QDDETR (Moon et al. 2023) to utilize 80% videos for training and the remaining for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It only mentions using pre-trained models and fine-tuning. |
| Software Dependencies | No | The paper lists several software tools and models used (e.g., EVA-CLIP, Whisper, Mini LM, CLIP4Caption, AdamW optimizer) but does not provide specific version numbers for these software dependencies, which is required for reproducible description. |
| Experiment Setup | Yes | The hidden size is set to 768. During training, the batch size is set to 5 and learning rate is set to 1e-5, Adam W optimizer (Loshchilov and Hutter 2018) is used to minimize the training loss defined in Eq. (15). |