Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution

Authors: Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, Jianqiang Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate that our approach utilizes only 89% of the data and achieves an average normalized improvement of 95.3% across various high-dimensional robotic skill-learning tasks, highlighting its effectiveness in enhancing the adaptability and precision of robots in complex environments. We conducted experimental evaluations of the proposed method within the Isaac Gym (Makoviychuk et al. 2021) RL benchmark and performed comparative analyses against the sparse reward method, human-designed reward methods, and traditional LLM-designed reward function methods.
Researcher Affiliation Academia 1National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China 3Artificial Intelligence Research Institute, Shenzhen MSU-BIT University, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the method using mathematical formulations and detailed textual explanations of the framework components (Reward Evolution, Policy Evolution), but does not include a distinct block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets Yes We conducted experimental evaluations of the proposed method within the Isaac Gym (Makoviychuk et al. 2021) RL benchmark and performed comparative analyses against the sparse reward method, human-designed reward methods, and traditional LLM-designed reward function methods. Environments and Tasks We validated our approach on six robotic tasks within Isaac Gym, including Ant, Humanoid, Shadow Hand, Allegro Hand, Franka Cabinet, and Shadow Hand Upside Down as shown in Fig. 3.
Dataset Splits No The paper mentions using Isaac Gym for tasks and discusses the number of rounds and policy evaluations. However, it does not provide specific details on how datasets for these tasks were split into training, validation, or test sets, or reference standard splits from the Isaac Gym benchmark.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications used for running the experiments.
Software Dependencies No In our experiments, we employed the large language model GPT-4o to generate reward functions. The RL method used to validate our proposed approach was Proximal Policy Optimization (PPO) (Schulman et al. 2017). The paper mentions these software components but does not provide specific version numbers for them or any other libraries.
Experiment Setup Yes In our experiments, we employed the large language model GPT-4o to generate reward functions. Testing revealed that this model outperformed GPT-4 on most tasks (with the exception of the Franka Cabinet task) in terms of average performance. The RL method used to validate our proposed approach was Proximal Policy Optimization (PPO) (Schulman et al. 2017). In all experimental methods, the LLM conducted a total of N = 5 rounds of reward design for each robotic task, generating K = 6 reward functions for each round. For the algorithm proposed in this paper, each reward function underwent policy evolution, where the Gaussian Process was initialized with fusion ratio points αinitial = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]. Each reward function underwent a total of J = 12 policy model evaluations, with TBO = 200 for each evaluation. The other settings of our experiments can be found in the Appendix.