Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search
Authors: Max Liu, Chan-Hung Yu, Wei-Hsu Lee, Cheng-Wei Hung, Yen-Chun Chen, Shao-Hua Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our proposed LLM-GS framework in the Karel domain to various existing PRL methods (Trivedi et al., 2021; Liu et al., 2023; Carvalho et al., 2024). The experimental results demonstrate that LLM-GS is significantly more effective and efficient than the existing methods. Extensive ablation studies further verify the critical role of our Pythonic DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs. |
| Researcher Affiliation | Collaboration | Max Liu1 Chan-Hung Yu1 Wei-Hsu Lee1 Cheng-Wei Hung1 Yen-Chun Chen2 Shao-Hua Sun1 1National Taiwan University 2Microsoft |
| Pseudocode | Yes | The pseudo-code of CEBS is described in Algorithm 1, and the pseudo-code of HC is described in Algorithm 2. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a direct link to a code repository. It mentions using GPT-4 as an LLM module and refers to re-implementations of baselines, but not their own code. |
| Open Datasets | Yes | We evaluate our proposed framework LLM-guided search (LLM-GS) using the Karel tasks from the two problem sets: Karel (Trivedi et al., 2021) (STAIRCLIMBER, MAZE, FOURCORNERS, TOPOFF, HARVESTER, and CLEANHOUSE) and Karel-Hard (Liu et al., 2023) (DOORKEY, ONESTROKE, SEEDER, and SNAKE). More task details can be found in Appendix C. |
| Dataset Splits | Yes | Specifically, for each task, the number of task variances is C. Task variances arise from differences in the environment s initial states. Each program evaluation means executing the program on all C task variances to obtain an average return. We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal. |
| Hardware Specification | No | The paper mentions using GPT-4 and GPT-4o as LLM modules, but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments or training models. |
| Software Dependencies | Yes | We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS. |
| Experiment Setup | Yes | We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal. We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS. |