Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Efficient Multi-Goal Reinforcement Learning via Value Consistency Prioritization
Authors: Jiawei Xu, Shuxing Li, Rui Yang, Chun Yuan, Lei Han
JAIR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that VCP achieves significantly higher sample efficiency than existing algorithms on a range of challenging goal-conditioned manipulation tasks. |
| Researcher Affiliation | Collaboration | Jiawei Xu EMAIL Shuxing Li EMAIL Tsinghua Shenzhen International Graduate School Shenzhen, Guangdong, China Rui Yang EMAIL The Hong Kong University of Science and Technology Hong Kong, China Chun Yuan EMAIL Tsinghua Shenzhen International Graduate School Shenzhen, Guangdong, China Lei Han EMAIL Tencent Robotics X Shenzhen, Guangdong, China |
| Pseudocode | Yes | Algorithm 1 Updating The HER Buffer 1: Input: the original buffer Bo and the HER buffer Bh; 2: while training is not ending do 3: Sample a mini-batch of transitions from Bo; 4: Replace the desired goals with some achieved goals following the future strategy in HER; 5: Calculate the variance of Q-values for each transition in the mini-batch; 6: Store the mini-batch in Bh; 7: Update the priority of each transition in Bh according to Eq. (4); 8: end while |
| Open Source Code | Yes | Our implementation is available at https://github.com/jiawei415/VCP. |
| Open Datasets | Yes | We evaluate the VCP algorithm on 16 challenging goal-conditioned manipulation tasks, including 9 Hand environments, 4 Fetch environments, 2 Point environments and Reacher-v2. All environments are described here. ... For a detailed introduction to each environment, please refer to (Plappert et al., 2018). |
| Dataset Splits | Yes | We set batch size to 64 in the Fetch Reach-v1 and Reacher-v2 environments, and set batch size to 256 for the rest of the environments. We trained for 50 epochs per environment and did not use MPI for parallel data generation like (Plappert et al., 2018). In each epoch, we collect data for n times, and each time we generate 16 trajectories with length 50. We set n to 5 for Fetch Reach-v1, 15 for Reacher-v2, and 50 for the rest of environment. After each training epoch, we evaluate each algorithm for 10 episodes, and calculate the average success rate of the 10 episodes. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud services used for running experiments. It mentions using 'vectorized environments' and '16 heads' which are software/architectural concepts. |
| Software Dependencies | No | The paper refers to using 'openai/baselines' for the HER implementation and indicates common hyperparameters, but it does not specify concrete version numbers for any software libraries, frameworks, or languages (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The hyper-parameters of the HER part of our algorithm are set according to the open source code . The learning rate of actor and critic is 0.001, buffer size is 1e6, polyak averaging coefficient for target network updating is 0.95, and number of additional goals used for replay is 4 with the future pattern. Each actor and each critic consists of a 3 layers fully connected network with 256 units in each layer. All actors share the first two layers of the network, as do the critics. We set 16 heads in our implementation. The priority temperature coefficient T in our algorithm, was chosen among [1, 5, 9, 11, 15] for each environment for best performance. We set batch size to 64 in the Fetch Reach-v1 and Reacher-v2 environments, and set batch size to 256 for the rest of the environments. We trained for 50 epochs per environment and did not use MPI for parallel data generation like (Plappert et al., 2018). In each epoch, we collect data for n times, and each time we generate 16 trajectories with length 50. We set n to 5 for Fetch Reach-v1, 15 for Reacher-v2, and 50 for the rest of environment. |