Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search

Authors: Max Liu, Chan-Hung Yu, Wei-Hsu Lee, Cheng-Wei Hung, Yen-Chun Chen, Shao-Hua Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our proposed LLM-GS framework in the Karel domain to various existing PRL methods (Trivedi et al., 2021; Liu et al., 2023; Carvalho et al., 2024). The experimental results demonstrate that LLM-GS is significantly more effective and efficient than the existing methods. Extensive ablation studies further verify the critical role of our Pythonic DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.
Researcher Affiliation Collaboration Max Liu1 Chan-Hung Yu1 Wei-Hsu Lee1 Cheng-Wei Hung1 Yen-Chun Chen2 Shao-Hua Sun1 1National Taiwan University 2Microsoft
Pseudocode Yes The pseudo-code of CEBS is described in Algorithm 1, and the pseudo-code of HC is described in Algorithm 2.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a direct link to a code repository. It mentions using GPT-4 as an LLM module and refers to re-implementations of baselines, but not their own code.
Open Datasets Yes We evaluate our proposed framework LLM-guided search (LLM-GS) using the Karel tasks from the two problem sets: Karel (Trivedi et al., 2021) (STAIRCLIMBER, MAZE, FOURCORNERS, TOPOFF, HARVESTER, and CLEANHOUSE) and Karel-Hard (Liu et al., 2023) (DOORKEY, ONESTROKE, SEEDER, and SNAKE). More task details can be found in Appendix C.
Dataset Splits Yes Specifically, for each task, the number of task variances is C. Task variances arise from differences in the environment s initial states. Each program evaluation means executing the program on all C task variances to obtain an average return. We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal.
Hardware Specification No The paper mentions using GPT-4 and GPT-4o as LLM modules, but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments or training models.
Software Dependencies Yes We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS.
Experiment Setup Yes We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal. We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS.