Robust and Data-efficient Q-learning by Composite Value-estimation

Authors: Gabriel Kalweit, Maria Kalweit, Joschka Boedecker

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the efficacy of Composite Q-learning in the tabular case and furthermore employ Composite Q-learning within TD3. We compare Composite TD3 with TD3 and TD3( ), which we introduce as an off-policy variant of TD( ). Moreover, we show that Composite TD3 outperforms TD3 as well as TD3( ) significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic immediate rewards.
Researcher Affiliation Academia Gabriel Kalweit EMAIL Maria Kalweit EMAIL Joschka Boedecker EMAIL Neurorobotics Lab and Brain Links-Brain Tools University of Freiburg Germany
Pseudocode Yes Algorithm 1: Composite Q-learning Algorithm 2: Deep Deterministic Continuous Composite Q-learning Algorithm 3: Deep Deterministic Continuous Off-policy TD( )
Open Source Code Yes Code based on the implementation of TD31 can be found in the supplementary2. 1https://github.com/sfujim/TD3 2https://github.com/Nr Lab Freiburg/composite-q-learning
Open Datasets Yes We apply Composite Q-learning within TD3 and compare against TD3 and TD3( ) on three robot simulation tasks of Open AI Gym (Brockman et al., 2016) based on Mu Jo Co (Todorov et al., 2012): Walker2d-v2, Hopper-v2 and Humanoid-v2.
Dataset Splits No The paper uses OpenAI Gym environments for simulated robot tasks, which are dynamic. It mentions "8 training runs" and "mean evaluation performance over 100 initial states," which describes the experimental protocol and evaluation methodology, but does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for a static dataset, which is what the question implies.
Hardware Specification No The paper mentions "simulated robot tasks" but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these simulations or train the models.
Software Dependencies No The paper mentions code is based on TD3's implementation and provides a GitHub link, but it does not specify exact versions for critical software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes For all approaches, we use Gaussian noise with σ = 0.15 for exploration and the optimized learning rate of 10 3 for the full Q-function. Target update (5 10 3) and actor setting (two hidden layers with 400 and 300 neurons and Re LU activation) are set as in (Fujimoto et al., 2018). For Humanoid-v2, we use a slightly changed parameter setting with a learning rate of 10 4 for both actor and critic as suggested in (Dorka et al., 2020). Table C.1: Configuration space of the hyperparameter optimization.