reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

Authors: Pingcheng Jian, Xiao Wei, Yanbaihui Liu, Samuel A. Moore, Michael M. Zavlanos, Boyuan Chen

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.
Researcher Affiliation	Academia	Pingcheng Jian EMAIL Duke University Xiao Wei EMAIL Duke University Yanbaihui Liu EMAIL Duke University Samuel A. Moore EMAIL Duke University Michael M. Zavlanos EMAIL Duke University Boyuan Chen EMAIL Duke University
Pseudocode	Yes	Algorithm 1: LAPP Preference Predictor Training Algorithm 2: LAPP
Open Source Code	No	Project Website: www.generalroboticslab.com/LAPP
Open Datasets	No	The paper uses generated data (state-action trajectories and preference labels) from the experimental setup, rather than pre-existing publicly available datasets. For instance, it states, "LAPP leverages large language models (LLMs) to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL)." While it mentions standard benchmarks like "Gym-Mujoco" and "Bidexterous Manipulation (Dexterity) benchmark", it does not provide concrete access information for a specific dataset used for the experiments themselves.
Dataset Splits	Yes	The preference dataset Dp = {(σ0, σ1, y)} is split into training (Dtrain p ) and validation (Dval p ) sets at a 9 : 1 ratio.
Hardware Specification	Yes	The training process runs on an NVIDIA RTX A6000 GPU.
Software Dependencies	Yes	We use GPT-4o mini (Achiam et al., 2023) (gpt-4o-mini-2024-07-18 variant) as the LLM backbone for LAPP. For the Eureka baseline, we use GPT-4o (gpt-4o-2024-08-06 variant) to ensure a faithful reproduction of its full capabilities from the original work. ... The policy is optimized using PPO (Schulman et al., 2017)...
Experiment Setup	Yes	In practice, we set P = 9, C = 3, Kmin = 30, Kmax = 90, and α = 1.3. ... To mitigate noisy outputs which could pose potential risks to training stability, we sample 15 preference labels for each trajectory pair and calculate the mode of them as the final selected preference labels {yi}. ... r = βrp + r E (7) where β balances their contributions. We set the β to be 1.0 in all the tasks except the Backflip. The Backflip task has some reward items with large scales, so the β is set to 50.0 to ensure the effective influence of the preference rewards. ... The preference predictor is a transformer network (Waswani et al., 2017) based on the GPT architecture (Radford, 2018) with 6 masked self-attention layers. Inputs are embedded into a 128-dimensional space with sinusoidal positional encodings and processed by 8-headed attention layers. ... For quadruped tasks, the policy is an MLP with layers [512, 256, 128] and ELU activations (Clevert, 2015), outputting 12 target joint angles. A PD controller computes the torque commands.