reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

State-Constrained Offline Reinforcement Learning

Authors: Charles Alexander Hepburn, Yue Jin, Giovanni Montana

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, we introduce Sta CQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020).
Researcher Affiliation	Academia	1Mathematics Institute, University of Warwick, Coventry, UK 2Warwick Manufacturing Group, University of Warwick, Coventry, UK 3Department of Statistics, University of Warwick, Coventry, UK 4Alan Turing Institute, London, UK EMAIL
Pseudocode	Yes	Algorithm 1 Sta CQ
Open Source Code	Yes	The code can be found at https://github.com/Charles Hepburn1/ State-Constrained-Offline-Reinforcement-Learning.
Open Datasets	Yes	We evaluate Sta CQ against several model-free and model-based baselines on the D4RL benchmarking datasets from the Open AI Mujoco tasks (Todorov et al., 2012; Fu et al., 2020).
Dataset Splits	Yes	For the Mu Jo Co locomotion tasks we evaluate our method over 5 seeds each with 10 evaluation trajectories; whereas for the Antmaze tasks we also evaluate over 5 seeds but with 100 evaluation trajectories.
Hardware Specification	Yes	Our experiments were performed with a single Ge Force GTX 3090 GPU and an Intel Core i9-11900K CPU at 3.50GHz.
Software Dependencies	No	The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler.
Experiment Setup	Yes	For our experiments we use ϵ = 0.1 for the state reachability criteria... The actor, critic and reward model are represented as neural networks with two hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser (Kingma & Ba, 2014) and have learning rates 3e 4, the actor also has a cosine scheduler. We use an ensemble of 4 critic networks and take the minimum value across the networks. Also we use soft parameter updates for the target critic network with parameter τ = 0.005, and we use a discount factor of γ = 0.99. For the locomotion tasks we use a shared target value to update the critic towards, whereas for the Antmaze tasks we use independent target values for each critic value. Both the inverse and forward dynamics models are represented as neural networks with three hidden layers of size 256 and Re LU activation. They are trained using the ADAM optimiser with a learning rate of 4e 3 and a batch size of 256. We use an ensemble of 7 forward models and 3 inverse models and then take a final prediction as an average across the ensemble.