reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy Optimization under Imperfect Human Interactions with Agent-Gated Shared Autonomy

Authors: Zhenghai Xue, Bo An, Shuicheng YAN

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments are conducted with both simulated and real human participants at different skill levels in challenging continuous control environments. Comparative results highlight that AGSA achieves significant improvements over previous human-in-the-loop learning methods in terms of training safety, policy performance, and user-friendliness. Project webpage is at https://agsa4rl.github.io/. For empirical evaluations, we select two challenging continuous control tasks of robotic locomotion and autonomous driving, using the Mu Jo Co (Todorov et al., 2012) and Meta Drive (Li et al., 2023) simulator. We employ neural policies with varying performance levels, along with human participants inexperienced in evaluation tasks, to provide imperfect human involvement. Comparative results demonstrate that AGSA learns efficiently from imperfect data while maintaining overall training safety. Our contributions in this paper can be summarized as follows: (1) We identify the challenges posed by imperfect low-level human control and propose to utilize high-level human feedback instead. (2) We design a novel framework for agent-gated shared autonomy, where the gating agent is trained with human feedback and the learning agent is trained with intervention decisions from the gating agent. (3) We provide both theoretical and empirical evidence to support the efficiency and safety of the proposed framework.
Researcher Affiliation	Collaboration	1Nanyang Technological University, Singapore 2Skywork AI 3National University of Singapore
Pseudocode	Yes	Summarizing previous analysis, we present the detailed workflow of AGSA in Alg. 1. Line 5 and Line 10 construct the replay buffer Dl for training the learning agent, assigning rewards based on gating agent outputs. Line 6 corresponds to three kinds of human interactions, including human demonstrations, human evaluative feedback on the intervention decision, and human preference feedback on the demonstrations. Line 7 constructs the replay buffer Dg for training the gating value Qg and the dataset Dp for training the reward model rψ. Line 8 denotes that the preference pair (σ, σ ) is constructed with the current and the previous human generated trajectory. In Line 13, πl and Qg can be trained with any value-based RL algorithms, such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018), and rψ is trained with Eq. 3. Algorithm 1 The practical workflow of AGSA. 1: Input: Gating value function Qg; Learning agent policy πl; Human policy πh; Human preference model Pψ; Reward model rψ; Learning agent replay buffer Dl; Preference Replay buffer Dp; Gating agent replay buffer Dg; Preference reward ratio λ; Human intervention steps T. 2: for epoch i = 0, 1, 2, . . . do 3: for timestep t = 1, 2, . . . do 4: if Qg(st, 1) > Qg(st, 0) and not previous_intervene then 5: Append (st 1, at 1, st, 1) to Dl. 6: Apply human policy πh for T steps, getting trajectory segment σ; Query human for intervention evaluation I(st) and preference feedback pt = Pψ(σ σ ). 7: Append (st, 1, st+1, I(st) + λ PT n=0 rψ(st+n, at+n)) to Dg. Append (σ, σ , pt) to Dp. 8: Set σ = σ, previous_intervene=True, t = t + T 1. 9: else 10: Append (st 1, at 1, st, 0) to Dl and (st, 0, st+1, λrψ(st, at)) to Dg. 11: Apply learning agent policy πl for 1 step; Set previous_intervene=False. 12: Train πl, rψ, Qg on D, Dp, Dg, respectively.
Open Source Code	No	Project webpage is at https://agsa4rl.github.io/. This is a project webpage, which is considered a high-level overview page rather than a specific code repository link or an explicit statement of code release.
Open Datasets	Yes	Numbers are normalized scores according to D4RL (Fu et al., 2020).
Dataset Splits	Yes	For more accurate algorithm evaluation, we utilize the feature of procedure generation in Meta Drive and make a split of training and test environments with different maps and traffic.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions several software tools and algorithms like Mu Jo Co, Meta Drive, SAC, and TD3, but does not specify their version numbers or other ancillary software dependencies with versions.
Experiment Setup	Yes	Table 7: Hyperparameters for the training algorithms. Common Batch Size 256 Learning Rate 3e-4 Weight Decay 1e-3 Discount Factor γ 0.99 Hidden Dims (256,256) τ for Target Network Update 0.005 DAgger Pretrain Steps 60,000 Steps Per Iteraction 2500 Ensemble DAgger Uncertainty Threshold 0.03 (Hopper) 0.1 (Walker2d) 0.05 (Half Cheetah) 0.01 (Meta Drive) AGSA Reward Balancing Ratio λ 0.03 Human Intervention Steps T 4