reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution

Authors: Changxin Huang, Yanbin Chang, Junfan Lin, Junyang Liang, Runhao Zeng, Jianqiang Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results demonstrate that our approach utilizes only 89% of the data and achieves an average normalized improvement of 95.3% across various high-dimensional robotic skill-learning tasks, highlighting its effectiveness in enhancing the adaptability and precision of robots in complex environments. We conducted experimental evaluations of the proposed method within the Isaac Gym (Makoviychuk et al. 2021) RL benchmark and performed comparative analyses against the sparse reward method, human-designed reward methods, and traditional LLM-designed reward function methods.
Researcher Affiliation	Academia	1National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China 2Peng Cheng Laboratory, Shenzhen, China 3Artificial Intelligence Research Institute, Shenzhen MSU-BIT University, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the method using mathematical formulations and detailed textual explanations of the framework components (Reward Evolution, Policy Evolution), but does not include a distinct block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available.
Open Datasets	Yes	We conducted experimental evaluations of the proposed method within the Isaac Gym (Makoviychuk et al. 2021) RL benchmark and performed comparative analyses against the sparse reward method, human-designed reward methods, and traditional LLM-designed reward function methods. Environments and Tasks We validated our approach on six robotic tasks within Isaac Gym, including Ant, Humanoid, Shadow Hand, Allegro Hand, Franka Cabinet, and Shadow Hand Upside Down as shown in Fig. 3.
Dataset Splits	No	The paper mentions using Isaac Gym for tasks and discusses the number of rounds and policy evaluations. However, it does not provide specific details on how datasets for these tasks were split into training, validation, or test sets, or reference standard splits from the Isaac Gym benchmark.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	In our experiments, we employed the large language model GPT-4o to generate reward functions. The RL method used to validate our proposed approach was Proximal Policy Optimization (PPO) (Schulman et al. 2017). The paper mentions these software components but does not provide specific version numbers for them or any other libraries.
Experiment Setup	Yes	In our experiments, we employed the large language model GPT-4o to generate reward functions. Testing revealed that this model outperformed GPT-4 on most tasks (with the exception of the Franka Cabinet task) in terms of average performance. The RL method used to validate our proposed approach was Proximal Policy Optimization (PPO) (Schulman et al. 2017). In all experimental methods, the LLM conducted a total of N = 5 rounds of reward design for each robotic task, generating K = 6 reward functions for each round. For the algorithm proposed in this paper, each reward function underwent policy evolution, where the Gaussian Process was initialized with fusion ratio points αinitial = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]. Each reward function underwent a total of J = 12 policy model evaluations, with TBO = 200 for each evaluation. The other settings of our experiments can be found in the Appendix.