Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement

Authors: Zhi Wang, Li Zhang, Wenhao Wu, Yuanheng Zhu, Dongbin Zhao, Chunlin Chen

NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on Mu Jo Co and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites.
Researcher Affiliation Academia 1Nanjing University 2Institution of Automation, Chinese Academy of Sciences EMAIL EMAIL EMAIL
Pseudocode Yes Appendix A. Algorithm Pesudocodes Based on the implementations in Sec. 4, this section gives the brief procedures of our method. First, Algorithm 1 presents the pretraining of the context-aware world model. Then, Algorithm 2 shows the pipeline of training Meta-DT, where the sub-procedure of generating the complementary prompt is given in Algorithm 3. Finally, Algorithm 4 and Algorithm 5 show the few-shot and zero-shot evaluations on test tasks, respectively.
Open Source Code Yes Our code is available at https://github.com/NJU-RL/Meta-DT.
Open Datasets Yes We evaluate all tested methods on three classical benchmarks in meta-RL: i) the 2D navigation environment Point-Robot [25]; ii) the multi-task Mu Jo Co control [55, 36], containing Cheetah-Vel, Cheetah-Dir, Ant-Dir, Hopper-Param, and Walker-Param; and iii) the Meta World manipulation platform [56], including Reach, Sweep, and Door-Lock.
Dataset Splits No For each environment, we randomly sample a distribution of tasks and divide them into the training set T train and test set T test. ... For the Point-Robot and Mu Jo Co environments, we sample 45 tasks for training and another 5 held-out tasks for testing. For Meta-World environments, we sample 15 tasks for training and 5 held-out tasks for testing. (No explicit validation set mentioned for their Meta-DT training, only train/test.)
Hardware Specification Yes We train our models on one Nvidia RTX4080 GPU with the Intel Core i9-10900X CPU and 256G RAM.
Software Dependencies No The paper mentions implementing Meta-DT based on the official DT codebase and notes optimizer (Adam) and other parameters, but it does not specify versions for key software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes Some common hyperparameters across all report units are set as: optimizer Adam, weight decay 1e-4, linear warmup steps for learning rate decay 10000, gradient norm clip 0.25, dropout 0.1, and batch size 128. Table 7 presents the detailed hyperparameters of Meta-DT trained on the Point-Robot and Mu Jo Co domains with the Medium, Expert, and Mixed datasets. Table 8 presents the detailed hyperparameters of Meta-DT trained on Meta-World environments with the Medium datasets.