Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach
Authors: Minting Pan, Yitao Zheng, Jiajian Li, Yunbo Wang, Xiaokang Yang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Veo RL on a diverse set of visual control benchmarks. For the Meta-World robotic manipulation tasks, we use the Bridge Data-V2 dataset... For the CARLA autonomous driving tasks, we employ the Nu Scenes dataset... For the Mine Dojo open-world gaming environment... We evaluate Veo RL on a diverse set of visual control benchmarks. As summarized in Figure 1b, Veo RL achieves substantial performance improvements over existing RL approaches. 4. Experiments In this section, we present (i) quantitative comparisons with existing RL methods on a diverse set of visual control benchmarks, (ii) offline-to-online transfer learning results on novel interactive control tasks, (iii) ablation studies for each proposed model component in Veo RL, and (iv) hyperparameter analyses along with visualizations of the learned latent behavior abstractions. |
| Researcher Affiliation | Academia | 1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University. Correspondence to: Yunbo Wang <EMAIL>. |
| Pseudocode | Yes | A. Algorithm of Veo RL We provide the overall training scheme of Veo RL in Algorithm 1. Algorithm 1 The training scheme of Veo RL |
| Open Source Code | Yes | Project page: https://panmt.github.io/Veo RL.github.io. |
| Open Datasets | Yes | For the Meta-World robotic manipulation tasks, we use the Bridge Data-V2 dataset (Walke et al., 2023) as the source of auxiliary real-world videos. For the CARLA autonomous driving tasks, we employ the Nu Scenes dataset (Caesar et al., 2019), a collection of 1,000 diverse real-world driving scenes, as the auxiliary source domain. For the Mine Dojo open-world gaming environment, where the agent must navigate a vast state space, we use online Minecraft videos created by human players to provide unlabeled video demonstrations. Like D4RL (Fu et al., 2020), we collect the offline RL datasets of mediumquality trajectories using a partially-trained Dreamer V2 agent (Hafner et al., 2020). |
| Dataset Splits | No | Meta-World (Yu et al., 2019): ...Each dataset consists of 200 trajectories, each with 500 time steps. CARLA (Dosovitskiy et al., 2017): ...The offline dataset includes approximately 1,000 episodes. Mine Dojo (Fan et al., 2022): ...We set the maximum time steps of each episode as 1,000... This text describes the overall dataset sizes and trajectory lengths but does not specify how these datasets are partitioned into training, validation, or test sets for the experiments presented. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | B. Implementation Details The model configurations and hyperparameter settings are detailed in Table 2. Table 2. An overview of layers and hyperparameters used in Veo RL in the three environments. Item Meta-World CARLA Mine Dojo World Model Image encoder Conv3-32 Conv3-32 Conv3-96 GRU hidden size 200 200 4096 RSSM number of units 200 200 1024 Stochastic latent dimension 50 50 32 Discrete latent classes 0 0 32 Weighting factor α 1 1 1 World model learning rate 3 10 4 3 10 4 1 10 4 BAN s update iterations 40K 40K 30K Behavior Learning Imagination horizon L 15 15 15 λ-target 0.95 0.95 0.95 Discount γ in Eq. (6) 0.99 0.99 0.99 ω in Eq. (6) 0.05 0.1 1 10 5 ρ in Eq. (7) 0 0 1 η in Eq. (7) 1 10 4 1 10 4 3 10 4 MLP number of Policy network 4 4 5 MLP number of Value network 3 3 5 Policy network learning rate 8 10 5 8 10 5 3 10 5 Value network learning rate 8 10 5 8 10 5 3 10 5 Environment Setting Time limit 500 1000 1000 Action repeat 1 4 1 Image size 64 64 64 64 64 64 |