Robust Offline Imitation Learning from Diverse Auxiliary Data

Authors: Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate that ROIDA achieves robust and consistent performance across multiple auxiliary datasets with diverse ratios of expert and non-expert demonstrations. ROIDA effectively leverages unlabeled auxiliary data, outperforming prior methods reliant on specific data assumptions. Our code is available at https://github.com/uditaghosh/roida. Experiments on the D4RL benchmark (Fu et al., 2020) show that ROIDA consistently achieves high performance across seven environments, using auxiliary datasets with varying proportions of expert data.
Researcher Affiliation Collaboration Udita Ghosh EMAIL University of California, Riverside Dripta S. Raychaudhuri EMAIL AWS AI Labs Jiachen Li EMAIL University of California, Riverside Konstantinos Karydis EMAIL University of California, Riverside Amit K. Roy-Chowdhury EMAIL University of California, Riverside
Pseudocode Yes The pseudo-code for the overall framework is presented in Algorithm 1. Algorithm 1 Robust Offline Imitation from Diverse Auxiliary Data (ROIDA ) Require: Dataset DE and DE, hyperparameter η, α, β, γ
Open Source Code Yes Our code is available at https://github.com/uditaghosh/roida.
Open Datasets Yes Experiments on the D4RL benchmark (Fu et al., 2020) show that ROIDA consistently achieves high performance across seven environments, using auxiliary datasets with varying proportions of expert data. All datasets are from D4RL (Fu et al., 2020), an offline IL benchmark.
Dataset Splits No The paper describes how the auxiliary datasets DE and DO are created by sampling trajectories from D4RL datasets, but it does not specify how these constructed datasets are further split into training/test/validation sets for the policy learning task itself. Evaluation is based on the learned policy's performance in the environment, not on held-out demonstration data splits.
Hardware Specification Yes All experiments are conducted using PyTorch on a single RTX 3090 GPU.
Software Dependencies No The paper mentions "PyTorch" as a software dependency but does not provide a specific version number. It does not list any other software with version numbers.
Experiment Setup Yes We largely follow the architecture and hyperparameters from DWBC (Xu et al., 2022) for fair comparison. The policy network is a 3-layer MLP with 256 hidden units and tanh outputs. The discriminator is a 4-layer MLP with 128 hidden units, with sigmoid outputs clipped to [0.1, 0.9]. In the PU learning objective 1, we replace the non-differentiable max with the softplus function to make the loss function differentiable. The Q-function network is an MLP of 3 layers with 256 units. All networks use ReLU activations and the Adam optimizer. The discriminator learning rate is set to 1e-4 and a cosine annealing scheduler is added. The policy and Q-function learning rate is set to 3e-4, with a policy weight decay of 0.005. The balancing factors α and β are set dynamically based on the loss ratios. Considering the batch-wise BC loss on expert data to be λ1, the batch-wise weighted BC loss on auxiliary data to be λ2, and batch wise Q-function loss to be λ3, then α = 0.01 λ1 / λ2 * 1 / 7.5 and β = 0.01 λ1 / λ3 * 1 / 7.5. The additional factor of 1 / 7.5 emphasizes the BC loss on expert data, and is adopted from previous work. The discount factor γ is 0.5. The frequency of actor model update, tfreq is set to 3 for all the environments. The DICE reward function bounds r(s, a) between [-2.2, 2.2]. For filtering high quality data, we use τ = 1.