Action-Constrained Imitation Learning

Authors: Chia-Han Yeh, Tse-Sheng Nan, Risto Vuorio, Wei Hung, Hung Yen Wu, Shao-Hua Sun, Ping-Chun Hsieh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method in various continuous control domains, including navigation, locomotion, and robot arm manipulation, subject to a variety of action constraints. To evaluate our method, we conducted experiments on four benchmark tasks: Maze2d, Half Cheetah, Hopper, and Table-Wiping. The experimental results in Table 2 indicate that online algorithms, such as GAIL and OPOLO, face two primary challenges: poor sample efficiency and action constraints, leading to consistently suboptimal performance in all tasks. We conduct an ablation study by varying the number of expert demonstrations used for training. Specifically, we investigate whether DTWIL can still learn effective policies when provided with fewer expert trajectories, which limits coverage of the initial state distribution. As shown in Table 12, DTWIL consistently achieves strong performance across different demonstration sizes, indicating that it can generalize well even with limited access to expert data.
Researcher Affiliation Collaboration 1National Yang Ming Chiao Tung University, Hsinchu, Taiwan 2University of Illinois at Urbana-Champaign, Illinois, United States 3Reflection AI 4National Taiwan University, Taipei, Taiwan.
Pseudocode Yes An overview of DTWIL is provided in Figure 2, and the pseudo code can be found in Algorithm 1 and Algorithm 2.
Open Source Code Yes Our code is publicly available at https:// github.com/NYCU-RL-Bandits-Lab/ ACRL-Baselines.
Open Datasets Yes Maze2d-Medium-v1 (Fu et al., 2020), a point-mass agent navigates a 2D maze from a random start location to a goal, with a 2-dimensional action space [a1, a2] [ 1.0, 1.0]. Half Cheetah (Brockman et al., 2016), a bipedal cheetah runs forward by applying torque through a 6-dimensional action space [a1, a2, . . . , a6] [ 1.0, 1.0]. Hopper (Brockman et al., 2016), a robot hops forward by controlling a 3-dimensional action space [a1, a2, a3] [ 1.0, 1.0]. Table-Wiping from Robosuite (Zhu et al., 2025), a robot arm controlled by a 6-dimensional action space aims to wipe a stained table.
Dataset Splits No The paper describes the amount of expert demonstrations used for each task (e.g., "100 demonstrations, yielding 18,525 state-action pairs" for Maze2d, "5 expert demonstrations of 1000 steps each" for Half Cheetah). It also conducts an ablation study by varying the "Number of Expert Demonstrations" (Table 12: 5 Demos, 3 Demos, 1 Demo). However, it does not explicitly provide specific training, validation, and test splits with percentages or fixed counts for a single dataset, nor does it refer to standard predefined splits for the imitation learning algorithms themselves. The evaluation is stated to be with "randomly initialized starting states" after training on the provided expert data.
Hardware Specification No The paper mentions: "We also thank the National Center for High-performance Computing (NCHC) for providing computational and storage resources." This is a general acknowledgement of computing resources but does not specify any particular hardware models (e.g., GPU, CPU, memory details) used for the experiments.
Software Dependencies No The paper mentions environments like "OpenAI Gym" and "Robosuite" but does not provide specific version numbers for any software, libraries, or programming languages used in the implementation or experimentation.
Experiment Setup Yes We only allow 50K environment steps during the training of all the online methods, including ours, on all tasks. All results are evaluated with randomly initialized starting states. The best-performing model from each algorithm during these interactions was selected for final evaluation. We explore the influence of the hyperparameter β, which regulates the balance between expert actions and MPC-sampled actions in the ERC method. Additionally, we examine the effect of the horizon length herc, which determines how many steps to blend MPC-sampled actions with expert actions. We conducted experiments on the Hopper with H+M constraints, varying β from 0 to 0.2 and herc from 0 to 20, while keeping all other hyperparameters fixed at their optimal values identified in prior tuning. As shown in Table 9, setting β to 0.05 results in the highest performance. Regarding herc, a value of 5 provides the best results.