Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Authors: Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate that Video-STa Raugmented LMMs achieve notable improvements in (I) general Video QA, where Temp Compass performance improved by 6.1%, and (II) downstream tasks, with a 9.9% increase in Kinetics700-QA accuracy and a 4.0% improvement in action quality assessment on Fine Diving, while also exhibiting better interpretability.
Researcher Affiliation Collaboration 1Stanford University, 2Google Research EMAIL
Pseudocode No The paper describes the Video-STaR process using flowcharts (Figure 1 and Figure 2) and detailed textual descriptions for steps like 'Answer Generation' and 'Label Rationalization', but it does not present structured pseudocode or algorithm blocks.
Open Source Code Yes All code, source files, and generated dataset text instructions can be found in our supporting information and will be made publicly available.
Open Datasets Yes In selecting source datasets, we selected datasets that contain diverse video content and label types, please see Tab. 1. These include Kinetics700 (Smaira et al., 2020), which has action recognition annotations and is particularly large and diverse. Fine Diving (Xu et al., 2022b) is an action quality assessment dataset of Olympic diving events and has both an overall score and action sequence annotations. Finally, STAR-benchmark (Wu et al., 2021), a video reasoning dataset, also contains bounding box and temporal action localization annotations.
Dataset Splits No The paper mentions using 'test sets (not included in training)' for adapted datasets and '1000 videos were randomly selected from each dataset for Gemini evaluation.' However, it does not provide specific percentages or counts for training, validation, and test splits for its overall experimental setup or the VSTAR-1M dataset, nor does it refer to a standard split by citation for its usage of combined datasets.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models.
Software Dependencies Yes We initialize from the Video-LLa VA (Lin et al., 2023) model, which utilizes the Vicuna-7B v1.5 (Chiang et al., 2023). ... All evaluations utilize the same GPT model (Wu, 2024) ( gpt-3.5-turbo ) to ensure consistent comparisons.
Experiment Setup Yes We train for one epoch using a 128 batch size, Adam W optimizer, and a cosine learning rate schedule. The learning rate is 2e 5 with a 0.03 warmup ratio. In combination with the generated Video-STa R instruction tuning dataset, we additionally utilized the Video Instruct-100K (Maaz et al., 2023) and the LLa VA v1.5 instruction tuning datasets (Liu et al., 2023a).