reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators

Authors: Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To comprehensively and accurately evaluate the model performance in Vid IT, we develop both objective and subjective metrics to assess the generated videos in terms of visual quality, semantic accuracy, and consistency with the prompted demonstrations. Our extensive experiments demonstrate that the model not only produces high-quality video clips but also successfully adheres to the semantic guidance provided by the demonstration examples. In addition, we show that the zero-shot imitation capacity also follows the scaling law (Kaplan et al., 2020) of large models, illustrating the potential of future works.
Researcher Affiliation	Collaboration	Wentao Zhang1,2, , , Junliang Guo3, , Tianyu He3, Li Zhao3, Linli Xu1,2, , Jiang Bian3 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3Microsoft Research Asia
Pseudocode	No	The paper describes the methods, training, and inference pipelines in Section 3 and details implementation in Section 4.2, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models have been open-sourced.
Open Datasets	Yes	As a result, among various public video datasets, we focus on these accomplish embodied tasks and select two primary datasets as our main training data sources: 1) Ego4d (Grauman et al., 2022), an egocentric video dataset featuring abundant first-person activities; and 2) Kinetics-600 (Carreira et al., 2018), a comprehensive video dataset comprising diverse human activities. Additionally, we incorporate self-collected videos that contains a large amount of general real-world videos, to augment the variety of video content. To validate the Vid IT s imitation capability, we choose the Something-Something v2 (SSv2) as the main evaluation dataset (Goyal et al., 2017)... In addition, we include the Robotics Transformer-1 (RT1) (Brohan et al., 2022) dataset and Mine RL1 to demonstrates Vid IT s effectiveness on embodied AI and interactive tasks. 1https://github.com/minerllabs/minerl
Dataset Splits	Yes	We utilize the evaluation split of SSv2 as the evaluation set for all experiments.
Hardware Specification	Yes	Our largest variant, Vid IT 1.1B, is trained on 2x8H100 nodes, with Pytorch DDP parallel strategy integrated in the Pytorch-Lightening trainer.
Software Dependencies	No	The paper mentions 'Pytorch DDP' and 'Pytorch-Lightening trainer' for implementation but does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	The hyperparameters used to train the Vid IT model are presented in Table 8. We utilize inverse square root scheduler and start model training with 10,000 warmup steps. Hyperparameter Value Learning rate scheduler inverse sqrt Learning rate 5e 4 Warm up steps 10000 Weight decay 0.01 Optimizer Adam W Adam W betas (0.9, 0.95) Context length 4096