Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Policy Search with High-Dimensional Context Variables
Authors: Voot Tangkaratt, Herke van Hoof, Simone Parisi, Gerhard Neumann, Jan Peters, Masashi Sugiyama
AAAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on three problems. We start by studying C-MORE behavior in a scenario where we know the true reward model and the true low-dimensional context. Subsequently, we focus our attention on two simulated robotic ball hitting tasks. In the ο¬rst task, a toy 2-Do F planar robot arm has to hit a ball placed on a plane. In the second task, a simulated 6-Do F robot arm has to hit a ball placed in a three-dimensional space. |
| Researcher Affiliation | Academia | Voot Tangkaratt The University of Tokyo, 113-0033 Tokyo, Japan EMAIL Herke van Hoof Mc Gill University, 3480 Rue University, Montreal, Canada Technical University of Darmstadt, 64289 Darmstadt, Germany Simone Parisi Technical University of Darmstadt, 64289 Darmstadt, Germany EMAIL Gerhard Neumann University of Lincoln, LN6 7TS Lincoln, United Kingdom Technical University of Darmstadt, 64289 Darmstadt, Germany EMAIL Jan Peters MPI for Intelligent Systems, 72076 Tuebingen, Germany Technical University of Darmstadt, 64289 Darmstadt, Germany EMAIL Masashi Sugiyama The University of Tokyo, 277-8561 Chiba, Japan RIKEN AIP Center, 351-0198 Saitama, Japan EMAIL |
| Pseudocode | Yes | Algorithm 1: C-MORE |
| Open Source Code | No | The paper does not provide any specific links or statements about the availability of its source code. |
| Open Datasets | No | The paper uses a "synthetic task with known ground truth" and "robotic ball hitting tasks based on camera images" where the images were collected or generated by the authors. No concrete access information (link, DOI, formal citation to a public dataset) is provided for these datasets. |
| Dataset Splits | Yes | For C-MORE Nuc. Norm, C-MORE LASSO and C-MORE PCA, we perform 5-fold cross-validation every 100 policy updates to choose the values of regularization parameter for nuclear norm, regularization parameter for β1 norm, and dimension dz, respectively. |
| Hardware Specification | No | The paper describes simulated robot arms and tasks but does not specify the hardware (e.g., CPU, GPU models) on which these simulations were run. |
| Software Dependencies | No | The paper mentions software like IPOPT and APG but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set Ξ³ = 0.99 and H0 = 150. The sampling Gaussian distribution is initialized with random mean and covariance Q = 10,000I. For learning, we collect 35 new samples and keeps track of the samples collected during the last 20 iterations to stabilize the policy update. The learning is performed for a maximum of 100 iterations. If the KL divergence is lower than 0.1, then the learning is considered to be converged and the policy is not updated anymore. |