reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search

Authors: Max Liu, Chan-Hung Yu, Wei-Hsu Lee, Cheng-Wei Hung, Yen-Chun Chen, Shao-Hua Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare our proposed LLM-GS framework in the Karel domain to various existing PRL methods (Trivedi et al., 2021; Liu et al., 2023; Carvalho et al., 2024). The experimental results demonstrate that LLM-GS is significantly more effective and efficient than the existing methods. Extensive ablation studies further verify the critical role of our Pythonic DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.
Researcher Affiliation	Collaboration	Max Liu1 Chan-Hung Yu1 Wei-Hsu Lee1 Cheng-Wei Hung1 Yen-Chun Chen2 Shao-Hua Sun1 1National Taiwan University 2Microsoft
Pseudocode	Yes	The pseudo-code of CEBS is described in Algorithm 1, and the pseudo-code of HC is described in Algorithm 2.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a direct link to a code repository. It mentions using GPT-4 as an LLM module and refers to re-implementations of baselines, but not their own code.
Open Datasets	Yes	We evaluate our proposed framework LLM-guided search (LLM-GS) using the Karel tasks from the two problem sets: Karel (Trivedi et al., 2021) (STAIRCLIMBER, MAZE, FOURCORNERS, TOPOFF, HARVESTER, and CLEANHOUSE) and Karel-Hard (Liu et al., 2023) (DOORKEY, ONESTROKE, SEEDER, and SNAKE). More task details can be found in Appendix C.
Dataset Splits	Yes	Specifically, for each task, the number of task variances is C. Task variances arise from differences in the environment s initial states. Each program evaluation means executing the program on all C task variances to obtain an average return. We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal.
Hardware Specification	No	The paper mentions using GPT-4 and GPT-4o as LLM modules, but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments or training models.
Software Dependencies	Yes	We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS.
Experiment Setup	Yes	We set the number of task variances C = 32, the maximum number of program evaluation N = 106, i.e., the interaction budget. Programs achieving an average return of 1.0 are considered optimal. We use GPT-4 (Achiam et al., 2023) (gpt-4-turbo-2024-04-09 with temperature=1.0, top p=1.0) as our LLM module to generate the initial search population. The scheduler of our proposed Scheduled HC starts from Kstart = 32 to Kend = 2048. We evaluate our LLM-GS and HC with 32 random seeds and 5 seeds for LEAPS, HPRL, and CEBS.