Identity-Text Video Corpus Grounding

Authors: Bin Huang, Xin Wang, Hong Chen, Houlun Chen, Yaofei Wu, Wenwu Zhu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of the proposed Video-Locator model and highlight the importance of identity-generalization capability for ITVCG. Main Results We selected several open-source VCG models, including XML(Lei et al. 2020), Re Lo CLNet(Zhang et al. 2021), and SQuid Net(Yoon et al. 2022), and trained on the TVR-IT dataset using their official Git Hub repositories. We used the same settings as Video-Locator: visual features (CLIP + Antelopev2), without subtitles, and modified the input part of the model (added facial adapters) to accept face inputs similar to Video-Locator. We reported the metrics in Table 2. Our Video-Locator significantly outperforms the other models.
Researcher Affiliation Academia Bin Huang1, Xin Wang*1, 2, Hong Chen1, Houlun Chen1, Yaofei Wu3, Wenwu Zhu*1, 2 1Department of Computer Science and Technology, Tsinghua University 2BNRIST, Tsinghua University 3Beijing University of Technology EMAIL EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., equations 1-22), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code https://github.com/huangb23/Identity-Text-Video Corpus-Grounding
Open Datasets Yes To solve this issue, we propose the TVR-IT dataset and the Video-Locator model for Identity-Text Video Corpus Grounding... Specifically, we construct the TVR-IT dataset based on the TVR(Lei et al. 2020) dataset. Lei, J.; Yu, L.; Berg, T. L.; and Bansal, M. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXI 16, 447 463. Springer.
Dataset Splits Yes Dataset split To better assess the generalization capability of the model, we provide (identity-)seen and (identity)unseen splits for both the validation and test sets. Specifically, four TV shows are utilized to construct the train, valseen, and test-seen sets, while one TV show is used for the val-unseen set, and the remaining one TV show is used for the test-unseen set. Table 1: The information of TVR-IT dataset. # ID represents the number of identities with corresponding images. BBT=The Big Bang Theory, Grey=Grey s Anatomy, HIMYM=How I Met Your Mother. Split # Videos # Queries Shows # ID train 15060 56735 BBT,Friends, House,Castle 374 val-seen 555 2084 test-seen 1371 5142 val-unseen 1257 4157 Grey 52 test-unseen 1371 4722 HIMYM 37
Hardware Specification No The paper discusses the use of Lo RA for training LLM decoders to minimize resource consumption, but does not specify any particular hardware details such as GPU models, CPU types, or memory used for experiments.
Software Dependencies Yes For the Video-Locator model, we employ Vicuna-v1.5-7B (Chiang et al. 2023) as the LLM, which comprises L = 32 layers.
Experiment Setup Yes For the Video-Locator model, we employ Vicuna-v1.5-7B (Chiang et al. 2023) as the LLM, which comprises L = 32 layers. Lu = 16 layers are dedicated to the video-identity-text alignment module, while the remaining 16 layers are utilized for the multimodal fine-grained fusion module. For each video, we uniformly sample nv = 100 frames, retaining a maximum of nf = 3 faces per frame. The temperature parameter τ is set to 0.07. We balance the losses using parameters λcl = 1, λl1 = 10, λgiou = 1, λbce = 4, respectively. We utilize Lo RA (Hu et al. 2022) for training the LLM decoders of our Video-Locator to minimize the consumption of training resources. The Lo RA rank parameters are set to r = 64.