Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Solving New Tasks by Adapting Internet Video Knowledge

Authors: Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform standardized evaluations across both robotic manipulation tasks (Yu et al., 2020) and continuous control (Tassa et al., 2018), and demonstrate that adapted video generative models are able to successfully act as accurate video planners for novel text-conditioned specifications across a variety of robotic tasks, and can also supervise the learning of novel text-conditioned policies.
Researcher Affiliation Academia 1Brown University, 2Harvard University
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code Yes Visualizations and code are provided at diffusion-supervision.github.io/adapt2act/. We also commit to open-sourcing our code, to support further reproducibility efforts in the community.
Open Datasets Yes Benchmarks: We evaluate to what degree adapted video models can facilitate downstream robotic behavior generalization across a variety of environments and tasks, spanning robotic manipulation to continuous control. We focus the bulk of our explorations on Meta World-v2 (Yu et al., 2020), which offers a suite of robotic manipulation tasks with different levels of complexity. Additionally, we extend our evaluation to Humanoid and Dog environments from the Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits Yes To study the effectiveness of adaptation techniques in a low data regime, we curate a small dataset of in-domain examples from 7 Meta World tasks (denoted with an asterisk in Table A1) to adapt pretrained video models. For each task, we utilize 25 expert videos for direct finetuning and probabilistic adaptation, while sampling a small set of non-consecutive observations for subject customization. During inference, we evaluate the adapted video models on 9 tasks, 7 of which are novel tasks that are not exposed during adaptation (denoted with no asterisk in Table A1).
Hardware Specification No Our research was conducted using computational resources at the Center for Computation and Visualization at Brown University. This statement is too general and does not provide specific hardware models (e.g., GPU, CPU, or memory details).
Software Dependencies No All adaptation techniques in our work are implemented using available open-sourced components. As mentioned in Section 4.1, we use publicly available checkpoints for pretrained large models, such as Animate Diff (Guo et al., 2023) as well as Stable Diffusion (Rombach et al., 2022). We utilize Dream Booth (Ruiz et al., 2023) for subject customization. Furthermore, we reuse the codebase provided by the authors of AVDC (Ko et al., 2024) for in-domain model training, with minimal adjustments to enable the latent diffusion. For policy learning, we follow Video-TADPo Le (Luo et al., 2024) framework, which itself is built off of publicly available Animate Diff and TDMPC (Hansen et al., 2022). While specific frameworks and models are mentioned with citations, no version numbers for the software libraries themselves (e.g., PyTorch, TensorFlow, or specific versions of the listed frameworks) are provided.
Experiment Setup Yes We include detailed hyperparameters for in-domain model training in Appendix D. Appendix D contains tables such as "Table A4: Hyperparameters for In-Domain Model Training", "Table A5: Video TADPo Le Noise Levels for Deep Mind Control", "Table A6: Hyperparamters of Inverse Dynamics Model Training", and "Table A9: TD-MPC hyperparameters", which list specific values for various experimental settings.