RIZE: Adaptive Regularization for Imitation Learning

Authors: Adib Karimi, Mohammad Mehdi Ebadzadeh

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method achieves expert-level performance on complex Mu Jo Co and Adroit environments, surpassing baseline methods on the Humanoid-v2 task with limited expert demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning. Our source code is available at https://github.com/adibka/RIZE.
Researcher Affiliation Academia Adib Karimi EMAIL Department of Computer Engineering Amirkabir University of Technology Mohammad Mehdi Ebadzadeh EMAIL Department of Computer Engineering Amirkabir University of Technology
Pseudocode Yes Algorithm 1 RIZE 1: Initialize Zϕ, πθ, λπE, and λπ 2: for step t in {1, . . . , N} do 3: Calculate Q(s, a) = E[Zϕ(s, a)] using Eq. 5 4: Update Zϕ using Eq. 9 5: ϕt+1 ϕt βZ ϕ[ L(ϕ)] 6: Update πθ (like SAC) θt+1 θt + βπ θE s D, a πθ( |s) [ min k=1,2 Qk(s, a) α log πθ(a|s)] 8: Update λπ and λπE using Eq. 8 9: λπ t+1 λπ t βλπ λπΓ(RQ, λ) 10: λπE t+1 λπE t βλπE λπE Γ(RQ, λ) 11: end for
Open Source Code Yes Our source code is available at https://github.com/adibka/RIZE. All implementation details used in our experiments are publicly available at https://github.com/adibka/RIZE.
Open Datasets Yes We study continuous-control imitation learning from state action expert samples, evaluating our algorithm on five Mu Jo Co (Todorov et al., 2012) benchmarks (Half Cheetah-v2, Walker2d-v2, Ant-v2, Humanoid-v2, Hopper-v2) and one Adroit Hand task (Rajeswaran et al., 2018). ... For Hammer-v1 from the Adroit suite (Rajeswaran et al., 2018), we use the D4RL dataset (Fu et al., 2021) and filter the top 100 episodes from the original 5,000.
Dataset Splits Yes We assess each method with three and ten expert trajectories. ... Expert trajectories for these tasks are taken from IQ-Learn (Garg et al., 2021) and were generated with Soft Actor Critic (Haarnoja et al., 2018); each trajectory contains 1,000 state action transitions. ... For Hammer-v1 from the Adroit suite (Rajeswaran et al., 2018), we use the D4RL dataset (Fu et al., 2021) and filter the top 100 episodes from the original 5,000.
Hardware Specification No The paper does not provide specific hardware details such as CPU, GPU models, or cloud instance types used for running the experiments.
Software Dependencies No Our architecture integrates components from Distributional SAC (DSAC) (Ma et al., 2020) and IQ-Learn (Garg et al., 2021), with hyperparameters tuned through search and ablation studies. ... The paper mentions using DSAC and IQ-Learn components but does not specify version numbers for general software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes Our architecture integrates components from Distributional SAC (DSAC) (Ma et al., 2020) and IQ-Learn (Garg et al., 2021), with hyperparameters tuned through search and ablation studies. Key configurations for experiments involving three and ten demonstrations are summarized in Table 1. ... The critic network is implemented as a three-layer multilayer perceptron (MLP) with 256 units per layer, trained using a learning rate of 3e-4. The policy network is a four-layer MLP, also with 256 units per layer. ... replay buffer size 10^6, batch size 256, 24 quantile levels, and 10,000 pretraining steps. ... Across all tasks, we set initial target reward parameters as λπE = 10 and λπ = 5.