LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

Authors: Pingcheng Jian, Xiao Wei, Yanbaihui Liu, Samuel A. Moore, Michael M. Zavlanos, Boyuan Chen

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.
Researcher Affiliation Academia Pingcheng Jian EMAIL Duke University Xiao Wei EMAIL Duke University Yanbaihui Liu EMAIL Duke University Samuel A. Moore EMAIL Duke University Michael M. Zavlanos EMAIL Duke University Boyuan Chen EMAIL Duke University
Pseudocode Yes Algorithm 1: LAPP Preference Predictor Training Algorithm 2: LAPP
Open Source Code No Project Website: www.generalroboticslab.com/LAPP
Open Datasets No The paper uses generated data (state-action trajectories and preference labels) from the experimental setup, rather than pre-existing publicly available datasets. For instance, it states, "LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL)." While it mentions standard benchmarks like "Gym-Mujoco" and "Bidexterous Manipulation (Dexterity) benchmark", it does not provide concrete access information for a specific dataset used for the experiments themselves.
Dataset Splits Yes The preference dataset Dp = {(σ0, σ1, y)} is split into training (Dtrain p ) and validation (Dval p ) sets at a 9 : 1 ratio.
Hardware Specification Yes The training process runs on an NVIDIA RTX A6000 GPU.
Software Dependencies Yes We use GPT-4o mini (Achiam et al., 2023) (gpt-4o-mini-2024-07-18 variant) as the LLM backbone for LAPP. For the Eureka baseline, we use GPT-4o (gpt-4o-2024-08-06 variant) to ensure a faithful reproduction of its full capabilities from the original work. ... The policy is optimized using PPO (Schulman et al., 2017)...
Experiment Setup Yes In practice, we set P = 9, C = 3, Kmin = 30, Kmax = 90, and α = 1.3. ... To mitigate noisy outputs which could pose potential risks to training stability, we sample 15 preference labels for each trajectory pair and calculate the mode of them as the final selected preference labels {yi}. ... r = βrp + r E (7) where β balances their contributions. We set the β to be 1.0 in all the tasks except the Backflip. The Backflip task has some reward items with large scales, so the β is set to 50.0 to ensure the effective influence of the preference rewards. ... The preference predictor is a transformer network (Waswani et al., 2017) based on the GPT architecture (Radford, 2018) with 6 masked self-attention layers. Inputs are embedded into a 128-dimensional space with sinusoidal positional encodings and processed by 8-headed attention layers. ... For quadruped tasks, the policy is an MLP with layers [512, 256, 128] and ELU activations (Clevert, 2015), outputting 12 target joint angles. A PD controller computes the torque commands.