LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning
Authors: Pingcheng Jian, Xiao Wei, Yanbaihui Liu, Samuel A. Moore, Michael M. Zavlanos, Boyuan Chen
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning. |
| Researcher Affiliation | Academia | Pingcheng Jian EMAIL Duke University Xiao Wei EMAIL Duke University Yanbaihui Liu EMAIL Duke University Samuel A. Moore EMAIL Duke University Michael M. Zavlanos EMAIL Duke University Boyuan Chen EMAIL Duke University |
| Pseudocode | Yes | Algorithm 1: LAPP Preference Predictor Training Algorithm 2: LAPP |
| Open Source Code | No | Project Website: www.generalroboticslab.com/LAPP |
| Open Datasets | No | The paper uses generated data (state-action trajectories and preference labels) from the experimental setup, rather than pre-existing publicly available datasets. For instance, it states, "LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL)." While it mentions standard benchmarks like "Gym-Mujoco" and "Bidexterous Manipulation (Dexterity) benchmark", it does not provide concrete access information for a specific dataset used for the experiments themselves. |
| Dataset Splits | Yes | The preference dataset Dp = {(σ0, σ1, y)} is split into training (Dtrain p ) and validation (Dval p ) sets at a 9 : 1 ratio. |
| Hardware Specification | Yes | The training process runs on an NVIDIA RTX A6000 GPU. |
| Software Dependencies | Yes | We use GPT-4o mini (Achiam et al., 2023) (gpt-4o-mini-2024-07-18 variant) as the LLM backbone for LAPP. For the Eureka baseline, we use GPT-4o (gpt-4o-2024-08-06 variant) to ensure a faithful reproduction of its full capabilities from the original work. ... The policy is optimized using PPO (Schulman et al., 2017)... |
| Experiment Setup | Yes | In practice, we set P = 9, C = 3, Kmin = 30, Kmax = 90, and α = 1.3. ... To mitigate noisy outputs which could pose potential risks to training stability, we sample 15 preference labels for each trajectory pair and calculate the mode of them as the final selected preference labels {yi}. ... r = βrp + r E (7) where β balances their contributions. We set the β to be 1.0 in all the tasks except the Backflip. The Backflip task has some reward items with large scales, so the β is set to 50.0 to ensure the effective influence of the preference rewards. ... The preference predictor is a transformer network (Waswani et al., 2017) based on the GPT architecture (Radford, 2018) with 6 masked self-attention layers. Inputs are embedded into a 128-dimensional space with sinusoidal positional encodings and processed by 8-headed attention layers. ... For quadruped tasks, the policy is an MLP with layers [512, 256, 128] and ELU activations (Clevert, 2015), outputting 12 target joint angles. A PD controller computes the torque commands. |