Robust and Data-efficient Q-learning by Composite Value-estimation
Authors: Gabriel Kalweit, Maria Kalweit, Joschka Boedecker
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the efficacy of Composite Q-learning in the tabular case and furthermore employ Composite Q-learning within TD3. We compare Composite TD3 with TD3 and TD3( ), which we introduce as an off-policy variant of TD( ). Moreover, we show that Composite TD3 outperforms TD3 as well as TD3( ) significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic immediate rewards. |
| Researcher Affiliation | Academia | Gabriel Kalweit EMAIL Maria Kalweit EMAIL Joschka Boedecker EMAIL Neurorobotics Lab and Brain Links-Brain Tools University of Freiburg Germany |
| Pseudocode | Yes | Algorithm 1: Composite Q-learning Algorithm 2: Deep Deterministic Continuous Composite Q-learning Algorithm 3: Deep Deterministic Continuous Off-policy TD( ) |
| Open Source Code | Yes | Code based on the implementation of TD31 can be found in the supplementary2. 1https://github.com/sfujim/TD3 2https://github.com/Nr Lab Freiburg/composite-q-learning |
| Open Datasets | Yes | We apply Composite Q-learning within TD3 and compare against TD3 and TD3( ) on three robot simulation tasks of Open AI Gym (Brockman et al., 2016) based on Mu Jo Co (Todorov et al., 2012): Walker2d-v2, Hopper-v2 and Humanoid-v2. |
| Dataset Splits | No | The paper uses OpenAI Gym environments for simulated robot tasks, which are dynamic. It mentions "8 training runs" and "mean evaluation performance over 100 initial states," which describes the experimental protocol and evaluation methodology, but does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for a static dataset, which is what the question implies. |
| Hardware Specification | No | The paper mentions "simulated robot tasks" but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these simulations or train the models. |
| Software Dependencies | No | The paper mentions code is based on TD3's implementation and provides a GitHub link, but it does not specify exact versions for critical software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | For all approaches, we use Gaussian noise with σ = 0.15 for exploration and the optimized learning rate of 10 3 for the full Q-function. Target update (5 10 3) and actor setting (two hidden layers with 400 and 300 neurons and Re LU activation) are set as in (Fujimoto et al., 2018). For Humanoid-v2, we use a slightly changed parameter setting with a learning rate of 10 4 for both actor and critic as suggested in (Dorka et al., 2020). Table C.1: Configuration space of the hyperparameter optimization. |