State-Constrained Offline Reinforcement Learning

Authors: Charles Alexander Hepburn, Yue Jin, Giovanni Montana

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we introduce Sta CQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020).
Researcher Affiliation Academia 1Mathematics Institute, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK EMAIL
Pseudocode Yes Algorithm 1 Sta CQ
Open Source Code Yes The code can be found at https://github.com/Charles Hepburn1/ State-Constrained-Offline-Reinforcement-Learning.
Open Datasets Yes We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020).
Dataset Splits Yes For the Mu Jo Co locomotion tasks we evaluate our method over 5 seeds each with 10 evaluation trajectories; whereas for the Antmaze tasks we also evaluate over 5 seeds but with 100 evaluation trajectories.
Hardware Specification Yes Our experiments were performed with a single Ge Force GTX 3090 GPU and an Intel Core i9-11900K CPU at 3.50GHz.
Software Dependencies No The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler.
Experiment Setup Yes For our experiments we use ϵ = 0.1 for the state reachability criteria... The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler. We use an ensemble of 4 critic networks and take the minimum value across the networks. Also we use soft parameter updates for the target critic network with parameter τ = 0.005, and we use a discount factor of γ = 0.99. For the locomotion tasks we use a shared target value to update the critic towards, whereas for the Antmaze tasks we use independent target values for each critic value. Both the inverse and forward dynamics models are represented as neural networks with three hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser with a learning rate of 4e 3 and a batch size of 256. We use an ensemble of 7 forward models and 3 inverse models and then take a final prediction as an average across the ensemble.