IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic
Authors: Stefano Viel, Luca Viano, Volkan Cevher
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Practical Contribution We apply an ensemble-based exploration technique, SOAR, to boost the performance of deep imitation learning algorithms built on SAC, demonstrating its effectiveness on Mu Jo Co environments. Specifically, we show that incorporating SOAR consistently boosts the performance of base methods such as Coherent Soft Imitation Learning (CSIL)(Watson et al., 2023), Maximum Likelihood IRL (ML-IRL)(Zeng et al., 2022) and RKL (Ni et al., 2021). As shown in Figure 1, our approach consistently outperforms the base algorithms across all Mu Jo Co environments. Notably, SOAR achieves the best performance of the baselines requiring only approximately half the number of learning episodes. |
| Researcher Affiliation | Academia | Stefano Viel * 1 Luca Viano * 1 Volkan Cevher 1 1EPFL, Lausanne. Correspondence to: Stefano Viel <EMAIL>, Luca Viano <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SOAR-Imitation Learning Algorithm 2 Tabular SOAR-IL Algorithm 3 COSTUPDATETABULAR Algorithm 4 OPTIMISTICQTABULAR ( OPTQTAB) Algorithm 5 OPTIMISTICQ-NN Algorithm 6 Base Method + SOAR pseudocode Algorithm 7 UPDATECRITICS Algorithm 8 UPDATECOST for RKL (f-IRL for reversed KL divergence) Algorithm 9 UPDATECOST for ML-IRL (State-Action version) Algorithm 10 UPDATECOST for CSIL |
| Open Source Code | Yes | 1Project code available at https://github.com/stefanoviel/SOAR-IL/tree/master |
| Open Datasets | Yes | We perform experiments for both state only and state action IL on the following Mu Jo Co (Todorov et al., 2012) environments: Ant, Hopper, Walker2d, and Humanoid. Environment: We use the Hopper-v5, Ant-v5, Half Cheetah-v5, and Walker2d-v5 environments from Open AI Gym. |
| Dataset Splits | Yes | Each plot compares the average normalized return across 4 Mu Jo Co environments with 16 expert trajectories for a base algorithm and its SOAR-enhanced version. For the state-only IL setting, we showcase the improvement on RKL (Ni et al., 2021) and ML-IRL (State-Only) (Zeng et al., 2022). In both cases, we found that using L = 4 critic networks and an appropriately chosen value for the standard deviation clipping threshold σ consistently improves upon the baseline. Expert Samples: The expert policy is trained using SAC. The training configuration uses 3000 epochs. ... After training 64 experts trajectories are collected to be used later for the agent training. Figure 2: Experiments from State-Only Expert Trajectories. 16 expert trajectories, average over 5 seeds, L = 4 Figure 3: Experiments from State-Action Expert Trajectories. 16 expert trajectories, average over 5 seeds, L = 4. Figure 4: Ablation for L on hard exploration task. State only imitation experiment in a hard exploration environment (used in the lower bound from (Moulin et al., 2025, Theorem 19)) . Results averaged over 5 seeds, for a dataset of 100 states sampled from the expert occupancy measure. Figure 7: Experiments from State-Only Expert Trajectories. 1 expert trajectories, average over 3 seeds, L = 4 Figure 8: Experiments from State-Action Expert Trajectories. 1 expert trajectories, average over 3 seeds, L = 4. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud instances). |
| Software Dependencies | No | Our starting code base is taken from the repository of f-IRL7 (Ni et al., 2021), and the implementation of the other algorithms are based on this one. GAILs: we used the implementation available from Stable-Baselines3 (Raffin et al., 2021). Update policy weights to ψk+1 using Adam (Kingma & Ba, 2015) on the loss Lk π. |
| Experiment Setup | Yes | Expert Samples: The expert policy is trained using SAC. The training configuration uses 3000 epochs. The agent explores randomly for the first 10 episodes before starting policy learning. A replay buffer of 1 million experiences is used, with a batch size of 100 and a learning rate of 1e-3. The temperature parameter (α) is set to 0.2. The policy updates occur every 50 steps, with 1 update per interval. Table 2: Core Hyperparameters Across Environments Parameter Walker2d Humanoid Hopper Ant Number of Iterations 1.5 M 1 M 1 M 1.2 M Reward Network size [64, 64] [64, 64] [64, 64] [128, 128] Policy Network size [256, 256] [256, 256] [256, 256] [256, 256] Reward Learning Rate 1e-4 1e-4 1e-4 1e-4 SAC Learning Rate 1e-3 1e-3 1e-3 1e-3 Figure 5: Mean return of ML-IRL Ant-v5 with a different number of neural networks. The grid search for the clipping values was performed over the following values [0.1 0.5 1 5 10 50]. We used α = 0.5, η = 4, and we scaled the standard deviation bonus by 0.001. |