reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust and Data-efficient Q-learning by Composite Value-estimation

Authors: Gabriel Kalweit, Maria Kalweit, Joschka Boedecker

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the efficacy of Composite Q-learning in the tabular case and furthermore employ Composite Q-learning within TD3. We compare Composite TD3 with TD3 and TD3( ), which we introduce as an off-policy variant of TD( ). Moreover, we show that Composite TD3 outperforms TD3 as well as TD3( ) significantly in terms of data-efficiency in multiple simulated robot tasks and that Composite Q-learning is robust to stochastic immediate rewards.
Researcher Affiliation	Academia	Gabriel Kalweit EMAIL Maria Kalweit EMAIL Joschka Boedecker EMAIL Neurorobotics Lab and Brain Links-Brain Tools University of Freiburg Germany
Pseudocode	Yes	Algorithm 1: Composite Q-learning Algorithm 2: Deep Deterministic Continuous Composite Q-learning Algorithm 3: Deep Deterministic Continuous Off-policy TD( )
Open Source Code	Yes	Code based on the implementation of TD31 can be found in the supplementary2. 1https://github.com/sfujim/TD3 2https://github.com/Nr Lab Freiburg/composite-q-learning
Open Datasets	Yes	We apply Composite Q-learning within TD3 and compare against TD3 and TD3( ) on three robot simulation tasks of Open AI Gym (Brockman et al., 2016) based on Mu Jo Co (Todorov et al., 2012): Walker2d-v2, Hopper-v2 and Humanoid-v2.
Dataset Splits	No	The paper uses OpenAI Gym environments for simulated robot tasks, which are dynamic. It mentions "8 training runs" and "mean evaluation performance over 100 initial states," which describes the experimental protocol and evaluation methodology, but does not provide specific training/test/validation dataset splits (e.g., percentages or sample counts) for a static dataset, which is what the question implies.
Hardware Specification	No	The paper mentions "simulated robot tasks" but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these simulations or train the models.
Software Dependencies	No	The paper mentions code is based on TD3's implementation and provides a GitHub link, but it does not specify exact versions for critical software components or libraries (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	For all approaches, we use Gaussian noise with σ = 0.15 for exploration and the optimized learning rate of 10 3 for the full Q-function. Target update (5 10 3) and actor setting (two hidden layers with 400 and 300 neurons and Re LU activation) are set as in (Fujimoto et al., 2018). For Humanoid-v2, we use a slightly changed parameter setting with a learning rate of 10 4 for both actor and critic as suggested in (Dorka et al., 2020). Table C.1: Configuration space of the hyperparameter optimization.