Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Authors: Wentao Zhang, Junliang Guo, Tianyu He, Li Zhao, Linli Xu, Jiang Bian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To comprehensively and accurately evaluate the model performance in Vid IT, we develop both objective and subjective metrics to assess the generated videos in terms of visual quality, semantic accuracy, and consistency with the prompted demonstrations. Our extensive experiments demonstrate that the model not only produces high-quality video clips but also successfully adheres to the semantic guidance provided by the demonstration examples. In addition, we show that the zero-shot imitation capacity also follows the scaling law (Kaplan et al., 2020) of large models, illustrating the potential of future works. |
| Researcher Affiliation | Collaboration | Wentao Zhang1,2, , , Junliang Guo3, , Tianyu He3, Li Zhao3, Linli Xu1,2, , Jiang Bian3 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence 3Microsoft Research Asia |
| Pseudocode | No | The paper describes the methods, training, and inference pipelines in Section 3 and details implementation in Section 4.2, but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models have been open-sourced. |
| Open Datasets | Yes | As a result, among various public video datasets, we focus on these accomplish embodied tasks and select two primary datasets as our main training data sources: 1) Ego4d (Grauman et al., 2022), an egocentric video dataset featuring abundant first-person activities; and 2) Kinetics-600 (Carreira et al., 2018), a comprehensive video dataset comprising diverse human activities. Additionally, we incorporate self-collected videos that contains a large amount of general real-world videos, to augment the variety of video content. To validate the Vid IT s imitation capability, we choose the Something-Something v2 (SSv2) as the main evaluation dataset (Goyal et al., 2017)... In addition, we include the Robotics Transformer-1 (RT1) (Brohan et al., 2022) dataset and Mine RL1 to demonstrates Vid IT s effectiveness on embodied AI and interactive tasks. 1https://github.com/minerllabs/minerl |
| Dataset Splits | Yes | We utilize the evaluation split of SSv2 as the evaluation set for all experiments. |
| Hardware Specification | Yes | Our largest variant, Vid IT 1.1B, is trained on 2x8H100 nodes, with Pytorch DDP parallel strategy integrated in the Pytorch-Lightening trainer. |
| Software Dependencies | No | The paper mentions 'Pytorch DDP' and 'Pytorch-Lightening trainer' for implementation but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | The hyperparameters used to train the Vid IT model are presented in Table 8. We utilize inverse square root scheduler and start model training with 10,000 warmup steps. Hyperparameter Value Learning rate scheduler inverse sqrt Learning rate 5e 4 Warm up steps 10000 Weight decay 0.01 Optimizer Adam W Adam W betas (0.9, 0.95) Context length 4096 |