Reinforcement Learning with Parameterized Actions
Authors: Warwick Masson, Pravesh Ranchod, George Konidaris
AAAI 2016 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce the Q-PAMDP algorithm for learning in these domains, show that it converges to a local optimum, and compare it to direct policy search in the goalscoring and Platform domains. |
| Researcher Affiliation | Academia | Warwick Masson and Pravesh Ranchod School of Computer Science and Applied Mathematics University of Witwatersrand Johannesburg, South Africa EMAIL EMAIL George Konidaris Department of Computer Science Duke University Durham, North Carolina 27708 EMAIL |
| Pseudocode | Yes | Algorithm 1 Q-PAMDP(k) |
| Open Source Code | No | No explicit statement or link regarding the public availability of source code for the described methodology was found. |
| Open Datasets | No | The paper describes experiments in the 'goalscoring' and 'Platform' domains, which appear to be simulation environments set up by the authors rather than pre-existing public datasets with explicit access information. It references 'Kitano et al. 1997' for the robot soccer problem, but this is a problem description, not a dataset citation with access details. |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits, percentages, or sample counts. It mentions 'averaged over 20 runs' for evaluation, but not data partitioning. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions algorithms like 'gradient-descent Sarsa(λ)' and 'e NAC' but does not provide specific software or library names with version numbers (e.g., Python 3.x, PyTorch 1.x) that are required to reproduce the experiments. |
| Experiment Setup | Yes | At each step we perform one e NAC update based on 50 episodes and then refit Qω using 50 gradient descent Sarsa(λ) episodes. |