C2IQL: Constraint-Conditioned Implicit Q-learning for Safe Offline Reinforcement Learning
Authors: Zifan Liu, Xinran Li, Jun Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results on DSRL benchmarks demonstrate the superiority of C2IQL compared to baseline methods in achieving higher rewards while guaranteeing safety constraints under different threshold conditions. We evaluate C2IQL in Bullet Safety-Gym (Gronauer, 2022) and Safety Gymnasium (Ji et al., 2023) with DSRL datasets under different threshold conditions. |
| Researcher Affiliation | Academia | 1Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China. Correspondence to: Jun Zhang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Cost Reconstruction Model and Algorithm 2 C2IQL are presented in the paper, outlining structured steps for the proposed methods. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository. |
| Open Datasets | Yes | Environments and Datasets. We evaluate C2IQL in Bullet Safety-Gym (Gronauer, 2022)... We use the DSRL (Liu et al., 2023a) dataset, which follows the D4RL (Fu et al., 2020) benchmark format. |
| Dataset Splits | No | The paper mentions using DSRL datasets but does not explicitly provide details about specific training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | Yes | Experiments are carried out on NVIDIA Ge Force RTX 3080 GPUs. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9), which are necessary for replication. |
| Experiment Setup | Yes | For C2IQL, the structure and most hyperparameters follow IQL (Kostrikov et al., 2022). Table 4 shows the hyperparameters of our algorithms. The discount factor of the reward is fixed at 0.99 and the number of discount factors for the cost is 3. ... For the cost reconstruction model, we use a 5-layer MLP with hidden dimensions of 512 for each layer. ... We pre-train the reconstruction model for 1e6 epochs for each environment. Table 4. Hyperparameters Hyperparameters Value κ1 0.7 κ2 0.9 γ (reward) 0.99 m 3 Batch size 512 Learning rate of V 1e-3 Learning rate of Q 1e-3 Learning rate of π 3e-4 Training steps 4e5 Testing frequency 5e3 |