Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning
Authors: Rishabh Agrawal, Nathan Dahlin, Rahul Jain, Ashutosh Nayyar
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a series of numerical experiments on Classic Control and Mu Jo Co environments, we demonstrate consistently superior empirical performance compared to many SOTA IL algorithms. |
| Researcher Affiliation | Academia | 1 University of Southern California 2 University at Albany, SUNY |
| Pseudocode | Yes | Algorithm 1: Markov Balance-based Imitation Learning (MBIL) |
| Open Source Code | Yes | Extended version https://github.com/rishabh-1086/MBIL |
| Open Datasets | Yes | We evaluate the empirical performance of our MBIL algorithm using the Mu Jo Co locomotion suite (Todorov, Erez, and Tassa 2012) and the classic control suite from Open AI Gym (Brockman et al. 2016). For constructing the demonstration dataset D, we use data from (Kostrikov, Nachum, and Tompson 2020b), where the Generative Adversarial Imitation Learning algorithm (Ho and Ermon 2016) was applied for Mu Jo Co tasks. For classic control tasks, we utilize pretrained and hyperparameter-optimized agents from the RL Baselines Zoo (Raffin 2020), employing a PPO agent for Lunar Lander-v2, a DQN agent for Cart Pole-v1, and an A2C agent for Acrobot-v1. |
| Dataset Splits | Yes | Inspired by (Jarrett, Bica, and van der Schaar 2020), we trained algorithms until convergence on datasets of 1, 3, 7, 10, or 15 trajectories sampled from a pool of 1000 expert trajectories and recorded the average scores over 300 episodes for each algorithm, repeating this process 10 times with varied initializations and trajectories. Mu Jo Co Tasks. Following the methodology outlined in (Kostrikov, Nachum, and Tompson 2020a; Sun et al. 2021), we use a single demonstration trajectory, validate performance every 500 training iterations across 10 episodes, and report means and standard deviations from 5 random seeds. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | Yes | The transition dynamics models Pη and Tψ are implemented using Real NVPs (Dinh, Sohl-Dickstein, and Bengio 2017). We employ the publicly available framework version 0.2 (Ardizzone et al. 2018-2022), utilizing their GLOWCoupling Blocks implementation. |
| Experiment Setup | Yes | A neural network (NN) with two hidden layers, each using the Re LU activation function, is used for representing the policy. For tasks with discrete actions (e.g., Classic Control tasks), the output layer has a dimension equal to the number of action dimensions and uses a softmax function to generate a probability distribution over actions given a state. In contrast, for Mu Jo Co tasks with continuous actions, the output consists of two separate layers: one for the mean and another for the standard deviations, each with a size equal to the action dimension. The policy is then modeled as a Gaussian distribution, with these parameters generated by the neural network. Training is performed with the Adam optimizer (Kingma and Ba 2015). Detailed implementation, including hyperparameters for MBIL and benchmark algorithms, are provided in the Appendix. The transition dynamics models Pη and Tψ are implemented using Real NVPs (Dinh, Sohl-Dickstein, and Bengio 2017). We employ the publicly available framework version 0.2 (Ardizzone et al. 2018-2022), utilizing their GLOWCoupling Blocks implementation. Detailed parameters for these models are provided in Table 1. A central aspect of our approach is performing imitation learning in the low data regime. To facilitate this, we introduce Gaussian noise as a regularizer for training the expert MC and transition MDP, which enhances training stability with limited data. |