An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

Authors: Haoran Xu, Shuozhe Li, Harshit Sikchi, Scott Niekum, Amy Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets. Project page at https://ryanxhr.github.io/IDRL/.
Researcher Affiliation Collaboration Haoran Xu1 , Shuozhe Li1 , Harshit Sikchi1, Scott Niekum2, Amy Zhang1,3 1 University of Texas at Austin, 2 UMass Amherst, 3 Meta AI
Pseudocode Yes Algorithm 1 Iterative Dual-RL (IDRL) 1: Initialize value functions Qϕ1, Vϕ2, Uψ1, Wψ2, policy network πθ, require α and dataset D1 = D 2: for k = 1, 2, , M do 3: for t = 1, 2, , N1 do 4: Sample transitions (s, a, r, s ) Dk 5: Update Qϕ1 and Vϕ2 by (4) and (3) 6: end for 7: Get action ratio wk(a|s) by Eq.(5) 8: for t = 1, 2, , N2 do 9: Sample transitions (s, a, s ) Dk 10: Update Uψ1 and Wψ2 by (11) and (12) 11: end for 12: Get state-action ratio wk(s, a) by Eq.(12) and Dk+1 = {(s, a, r, s ) Dk | wk(s, a) > 0} 13: end for 14: Learn πθ by Eq.(6) using DM and w M(s, a)
Open Source Code No Project page at https://ryanxhr.github.io/IDRL/. The text provides a project page URL, but it does not explicitly state that the source code for the methodology is available at this URL, nor is the URL itself a direct link to a code repository.
Open Datasets Yes We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. ... D4RL: Datasets for deep data-driven reinforcement learning. Ar Xiv preprint, 2020.
Dataset Splits No The paper describes how corrupted datasets are created by mixing expert and random transitions with specific proportions (e.g., 1%, 5%, 10% expert ratios with total transitions of 1,000,000 in Table 4). However, it does not explicitly provide standard training/test/validation splits for these or the D4RL datasets, which are required for reproducibility in the context of typical model evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No We implemented IDRL using PyTorch and ran it on all datasets. We used the Adam optimizer (Kingma & Ba, 2015). The paper mentions PyTorch and the Adam optimizer, but does not specify version numbers for any software libraries, which are crucial for reproducibility.
Experiment Setup Yes Both the policy and value networks are 3-layer MLPs with 256 hidden units. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-4. The λ was set to 0.6 to both 2 iterations, each running for 10^6 steps. The first 500k steps of each iteration are dedicated to learning the action ratio, followed by the remaining steps to optimize for the state-action distribution ratio. ... In Mujoco locomotion tasks, we computed the average mean returns over 10 evaluations every 5e4 training steps, across 7 different seeds. For Antmaze and Kitchen tasks, we calculated the average over 50 evaluations every 5e4 training steps, also across 7 seeds. ... We also use a target network with soft update weight 5e-3 for Q-function. ... The values of λ for all datasets are listed in Table 3.