State-Constrained Offline Reinforcement Learning
Authors: Charles Alexander Hepburn, Yue Jin, Giovanni Montana
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we introduce Sta CQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020). |
| Researcher Affiliation | Academia | 1Mathematics Institute, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK EMAIL |
| Pseudocode | Yes | Algorithm 1 Sta CQ |
| Open Source Code | Yes | The code can be found at https://github.com/Charles Hepburn1/ State-Constrained-Offline-Reinforcement-Learning. |
| Open Datasets | Yes | We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020). |
| Dataset Splits | Yes | For the Mu Jo Co locomotion tasks we evaluate our method over 5 seeds each with 10 evaluation trajectories; whereas for the Antmaze tasks we also evaluate over 5 seeds but with 100 evaluation trajectories. |
| Hardware Specification | Yes | Our experiments were performed with a single Ge Force GTX 3090 GPU and an Intel Core i9-11900K CPU at 3.50GHz. |
| Software Dependencies | No | The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler. |
| Experiment Setup | Yes | For our experiments we use ϵ = 0.1 for the state reachability criteria... The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler. We use an ensemble of 4 critic networks and take the minimum value across the networks. Also we use soft parameter updates for the target critic network with parameter τ = 0.005, and we use a discount factor of γ = 0.99. For the locomotion tasks we use a shared target value to update the critic towards, whereas for the Antmaze tasks we use independent target values for each critic value. Both the inverse and forward dynamics models are represented as neural networks with three hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser with a learning rate of 4e 3 and a batch size of 256. We use an ensemble of 7 forward models and 3 inverse models and then take a final prediction as an average across the ensemble. |