An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning
Authors: Haoran Xu, Shuozhe Li, Harshit Sikchi, Scott Niekum, Amy Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets. Project page at https://ryanxhr.github.io/IDRL/. |
| Researcher Affiliation | Collaboration | Haoran Xu1 , Shuozhe Li1 , Harshit Sikchi1, Scott Niekum2, Amy Zhang1,3 1 University of Texas at Austin, 2 UMass Amherst, 3 Meta AI |
| Pseudocode | Yes | Algorithm 1 Iterative Dual-RL (IDRL) 1: Initialize value functions Qϕ1, Vϕ2, Uψ1, Wψ2, policy network πθ, require α and dataset D1 = D 2: for k = 1, 2, , M do 3: for t = 1, 2, , N1 do 4: Sample transitions (s, a, r, s ) Dk 5: Update Qϕ1 and Vϕ2 by (4) and (3) 6: end for 7: Get action ratio wk(a|s) by Eq.(5) 8: for t = 1, 2, , N2 do 9: Sample transitions (s, a, s ) Dk 10: Update Uψ1 and Wψ2 by (11) and (12) 11: end for 12: Get state-action ratio wk(s, a) by Eq.(12) and Dk+1 = {(s, a, r, s ) Dk | wk(s, a) > 0} 13: end for 14: Learn πθ by Eq.(6) using DM and w M(s, a) |
| Open Source Code | No | Project page at https://ryanxhr.github.io/IDRL/. The text provides a project page URL, but it does not explicitly state that the source code for the methodology is available at this URL, nor is the URL itself a direct link to a code repository. |
| Open Datasets | Yes | We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. ... D4RL: Datasets for deep data-driven reinforcement learning. Ar Xiv preprint, 2020. |
| Dataset Splits | No | The paper describes how corrupted datasets are created by mixing expert and random transitions with specific proportions (e.g., 1%, 5%, 10% expert ratios with total transitions of 1,000,000 in Table 4). However, it does not explicitly provide standard training/test/validation splits for these or the D4RL datasets, which are required for reproducibility in the context of typical model evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or memory specifications). |
| Software Dependencies | No | We implemented IDRL using PyTorch and ran it on all datasets. We used the Adam optimizer (Kingma & Ba, 2015). The paper mentions PyTorch and the Adam optimizer, but does not specify version numbers for any software libraries, which are crucial for reproducibility. |
| Experiment Setup | Yes | Both the policy and value networks are 3-layer MLPs with 256 hidden units. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 1e-4. The λ was set to 0.6 to both 2 iterations, each running for 10^6 steps. The first 500k steps of each iteration are dedicated to learning the action ratio, followed by the remaining steps to optimize for the state-action distribution ratio. ... In Mujoco locomotion tasks, we computed the average mean returns over 10 evaluations every 5e4 training steps, across 7 different seeds. For Antmaze and Kitchen tasks, we calculated the average over 50 evaluations every 5e4 training steps, also across 7 seeds. ... We also use a target network with soft update weight 5e-3 for Q-function. ... The values of λ for all datasets are listed in Table 3. |