Policy Optimization under Imperfect Human Interactions with Agent-Gated Shared Autonomy
Authors: Zhenghai Xue, Bo An, Shuicheng YAN
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments are conducted with both simulated and real human participants at different skill levels in challenging continuous control environments. Comparative results highlight that AGSA achieves significant improvements over previous human-in-the-loop learning methods in terms of training safety, policy performance, and user-friendliness. Project webpage is at https://agsa4rl.github.io/. For empirical evaluations, we select two challenging continuous control tasks of robotic locomotion and autonomous driving, using the Mu Jo Co (Todorov et al., 2012) and Meta Drive (Li et al., 2023) simulator. We employ neural policies with varying performance levels, along with human participants inexperienced in evaluation tasks, to provide imperfect human involvement. Comparative results demonstrate that AGSA learns efficiently from imperfect data while maintaining overall training safety. Our contributions in this paper can be summarized as follows: (1) We identify the challenges posed by imperfect low-level human control and propose to utilize high-level human feedback instead. (2) We design a novel framework for agent-gated shared autonomy, where the gating agent is trained with human feedback and the learning agent is trained with intervention decisions from the gating agent. (3) We provide both theoretical and empirical evidence to support the efficiency and safety of the proposed framework. |
| Researcher Affiliation | Collaboration | 1Nanyang Technological University, Singapore 2Skywork AI 3National University of Singapore |
| Pseudocode | Yes | Summarizing previous analysis, we present the detailed workflow of AGSA in Alg. 1. Line 5 and Line 10 construct the replay buffer Dl for training the learning agent, assigning rewards based on gating agent outputs. Line 6 corresponds to three kinds of human interactions, including human demonstrations, human evaluative feedback on the intervention decision, and human preference feedback on the demonstrations. Line 7 constructs the replay buffer Dg for training the gating value Qg and the dataset Dp for training the reward model rψ. Line 8 denotes that the preference pair (σ, σ ) is constructed with the current and the previous human generated trajectory. In Line 13, πl and Qg can be trained with any value-based RL algorithms, such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018), and rψ is trained with Eq. 3. Algorithm 1 The practical workflow of AGSA. 1: Input: Gating value function Qg; Learning agent policy πl; Human policy πh; Human preference model Pψ; Reward model rψ; Learning agent replay buffer Dl; Preference Replay buffer Dp; Gating agent replay buffer Dg; Preference reward ratio λ; Human intervention steps T. 2: for epoch i = 0, 1, 2, . . . do 3: for timestep t = 1, 2, . . . do 4: if Qg(st, 1) > Qg(st, 0) and not previous_intervene then 5: Append (st 1, at 1, st, 1) to Dl. 6: Apply human policy πh for T steps, getting trajectory segment σ; Query human for intervention evaluation I(st) and preference feedback pt = Pψ(σ σ ). 7: Append (st, 1, st+1, I(st) + λ PT n=0 rψ(st+n, at+n)) to Dg. Append (σ, σ , pt) to Dp. 8: Set σ = σ, previous_intervene=True, t = t + T 1. 9: else 10: Append (st 1, at 1, st, 0) to Dl and (st, 0, st+1, λrψ(st, at)) to Dg. 11: Apply learning agent policy πl for 1 step; Set previous_intervene=False. 12: Train πl, rψ, Qg on D, Dp, Dg, respectively. |
| Open Source Code | No | Project webpage is at https://agsa4rl.github.io/. This is a project webpage, which is considered a high-level overview page rather than a specific code repository link or an explicit statement of code release. |
| Open Datasets | Yes | Numbers are normalized scores according to D4RL (Fu et al., 2020). |
| Dataset Splits | Yes | For more accurate algorithm evaluation, we utilize the feature of procedure generation in Meta Drive and make a split of training and test environments with different maps and traffic. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions several software tools and algorithms like Mu Jo Co, Meta Drive, SAC, and TD3, but does not specify their version numbers or other ancillary software dependencies with versions. |
| Experiment Setup | Yes | Table 7: Hyperparameters for the training algorithms. Common Batch Size 256 Learning Rate 3e-4 Weight Decay 1e-3 Discount Factor γ 0.99 Hidden Dims (256,256) τ for Target Network Update 0.005 DAgger Pretrain Steps 60,000 Steps Per Iteraction 2500 Ensemble DAgger Uncertainty Threshold 0.03 (Hopper) 0.1 (Walker2d) 0.05 (Half Cheetah) 0.01 (Meta Drive) AGSA Reward Balancing Ratio λ 0.03 Human Intervention Steps T 4 |