reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Authors: Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results demonstrate that Video-STa Raugmented LMMs achieve notable improvements in (I) general Video QA, where Temp Compass performance improved by 6.1%, and (II) downstream tasks, with a 9.9% increase in Kinetics700-QA accuracy and a 4.0% improvement in action quality assessment on Fine Diving, while also exhibiting better interpretability.
Researcher Affiliation	Collaboration	1Stanford University, 2Google Research EMAIL
Pseudocode	No	The paper describes the Video-STaR process using flowcharts (Figure 1 and Figure 2) and detailed textual descriptions for steps like 'Answer Generation' and 'Label Rationalization', but it does not present structured pseudocode or algorithm blocks.
Open Source Code	Yes	All code, source files, and generated dataset text instructions can be found in our supporting information and will be made publicly available.
Open Datasets	Yes	In selecting source datasets, we selected datasets that contain diverse video content and label types, please see Tab. 1. These include Kinetics700 (Smaira et al., 2020), which has action recognition annotations and is particularly large and diverse. Fine Diving (Xu et al., 2022b) is an action quality assessment dataset of Olympic diving events and has both an overall score and action sequence annotations. Finally, STAR-benchmark (Wu et al., 2021), a video reasoning dataset, also contains bounding box and temporal action localization annotations.
Dataset Splits	No	The paper mentions using 'test sets (not included in training)' for adapted datasets and '1000 videos were randomly selected from each dataset for Gemini evaluation.' However, it does not provide specific percentages or counts for training, validation, and test splits for its overall experimental setup or the VSTAR-1M dataset, nor does it refer to a standard split by citation for its usage of combined datasets.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU or CPU models.
Software Dependencies	Yes	We initialize from the Video-LLa VA (Lin et al., 2023) model, which utilizes the Vicuna-7B v1.5 (Chiang et al., 2023). ... All evaluations utilize the same GPT model (Wu, 2024) ( gpt-3.5-turbo ) to ensure consistent comparisons.
Experiment Setup	Yes	We train for one epoch using a 128 batch size, Adam W optimizer, and a cosine learning rate schedule. The learning rate is 2e 5 with a 0.03 warmup ratio. In combination with the generated Video-STa R instruction tuning dataset, we additionally utilized the Video Instruct-100K (Maaz et al., 2023) and the LLa VA v1.5 instruction tuning datasets (Liu et al., 2023a).