Unsupervised Zero-Shot Reinforcement Learning via Dual-Value Forward-Backward Representation

Authors: Jingbo Sun, Songjun Tu, Qichao Zhang, Haoran Li, Xin Liu, Yaran Chen, Ke Chen, Dongbin Zhao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, DVFB demonstrates both superior zero-shot generalization (outperforming on all 12 tasks) and fine-tuning adaptation (leading on 10 out of 12 tasks) abilities, surpassing state-of-the-art (SOTA) URL methods. Our code is available at https://github.com/bofusun/DVFB.
Researcher Affiliation Academia 1Institute of Automation, Chinese Academy of Sciences, 2Pengcheng Laboratory 3University of Chinese Academy of Sciences, 4Xi an Jiaotong-Liverpool University EMAIL EMAIL EMAIL, EMAIL
Pseudocode Yes We provide the complete pseudocode for DVFB, with the unsupervised pre-training phase described in Algorithm 1 and the downstream task fine-tuning phase described in Algorithm 2. Algorithm 1 DVFB Algorithm: Unsupervised Pre-training Phase Algorithm 2 DVFB Algorithm: Downstream Fine-tuning Phase Algorithm 3 Reward Mapping Mechanism
Open Source Code Yes Our code is available at https://github.com/bofusun/DVFB.
Open Datasets Yes Following the latest advancements (Yang et al., 2023; Bai et al., 2024), we evaluate task generalization performance using 12 downstream tasks across 3 domains in URLB (Laskin et al., 2021) and Deep Mind Control Suite (DMC) (Tassa et al., 2018).
Dataset Splits No The paper describes interaction steps for pre-training (2 million steps) and skill inference (10,000 steps) in online reinforcement learning environments. This defines how data is generated and used in different phases, but it does not provide specific train/test/validation splits for a fixed, pre-existing dataset, as would typically be described in supervised learning contexts.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models, memory, or cloud computing specifications.
Software Dependencies No The paper mentions 'RL backbone algorithm DDPG' in Table 4, but it does not provide specific version numbers for DDPG or any other software libraries, frameworks, or programming languages used.
Experiment Setup Yes Table 4: Hyper-parameter settings. This table lists various hyperparameters including Pre-training frames, Finetuning frames, Zero-shot selection frames, RL replay buffer size, Batch size, Optimizer, Learning rate, network architectures, and various coefficients like α, β, and η.