Conservative State Value Estimation for Offline Reinforcement Learning

Authors: Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, Qingwei Lin, Saravanakumar Rajmohan, Thomas Moscibroda, Dongmei Zhang

NeurIPS 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods. ... Experimental evaluation on continuous control tasks of Gym [7] and Adroit [8] in D4RL [9] benchmarks, showing that CSVE performs better than prior methods based on conservative Q-value estimation, and is strongly competitive among main SOTA algorithms.
Researcher Affiliation Collaboration Liting Chen Mc Gill University Montreal, Canada EMAIL Jie Yan Microsoft Beijing, China EMAIL Zhengdao Shao University of Sci. and Tech. of China Hefei, China EMAIL Lu Wang Microsoft Beijing, China EMAIL Qingwei Lin Microsoft Beijing, China EMAIL Saravan Rajmohan Microsoft 365 Seattle, USA EMAIL Thomas Moscibroda Microsoft Redmond, USA EMAIL Dongmei Zhang Microsoft Beijing, China EMAIL
Pseudocode Yes Algorithm 1 CSVE based Offline RL Algorithm
Open Source Code Yes We implement our method based on an offline deep reinforcement learning library d3rlpy [34]. The code is available at: https://github.com/2023Annonymous Author/csve .
Open Datasets Yes We conduct experimental evaluations on a variety of classic continuous control tasks of Gym[7] and Adroit[8] in the D4RL[9] benchmark. ... D4RL: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper mentions 'train' and 'test' in the context of experiments but does not explicitly describe a validation dataset split or a methodology for it (e.g., percentages, sample counts, or cross-validation setup).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper states 'We implement our method based on an offline deep reinforcement learning library d3rlpy [34]' but does not provide a specific version number for this library or any other software dependencies used in the experiments.
Experiment Setup Yes Table 3: Hyper-parameters of CSVE evaluation. B 5, number of ensembles in dynamics model; α 10, to control the penalty of OOD states; τ 10, budget parameter in Eq. 8; β In Gym domain, 3 for random and medium tasks, 0.1 for the other tasks; In Adroit domain, 30 for human and cloned tasks, 0.01 for expert tasks; γ 0.99, discount factor; H 1 million for Mujoco while 0.1 million for Adroit tasks; w 0.005, target network smoothing coefficient; lr of actor 3e-4, policy learning rate; lr of critic 1e-4, critic learning rate.