Neural Stochastic Differential Equations for Uncertainty-Aware Offline RL

Authors: Cevahir Koprulu, Franck Djeumou, ufuk topcu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results in D4RL and Neo RL Mu Jo Co benchmarks evidence that NUNO outperforms state-of-the-art methods in low-quality datasets by up to 93% while matching or surpassing their performance by up to 55% in some high-quality counterparts. We empirically evaluate NUNO against state-of-the-art (SOTA) offline model-based and model-free approaches in continuous control benchmarks, namely Mu Jo Co datasets in D4RL Fu et al. (2020) and Neo RL Qin et al. (2022).
Researcher Affiliation Academia 1The University of Texas at Austin 2Rensselaer Polytechnic Institute
Pseudocode No The paper describes the methodology using textual descriptions, mathematical equations, and figures. It does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Corr. to: Cevahir Koprulu (EMAIL), Franck Djeumou (EMAIL). Code REPRODUCIBILITY STATEMENT: Lastly, we put the core code of our approach in the supplementary details. The code includes dataloaders, execution code, and links to download all the datasets and models used.
Open Datasets Yes Our empirical results in D4RL and Neo RL Mu Jo Co benchmarks evidence that NUNO outperforms state-of-the-art methods in low-quality datasets by up to 93% while matching or surpassing their performance by up to 55% in some high-quality counterparts. We run experiments on 12 D4RL tasks, combining three Mu Jo Co environments (halfcheetah, hopper, and walker2d) and four datasets (random, medium, medium-replay, and medium-expert) per environment. We further evaluate NUNO in Neo RL Qin et al. (2022)
Dataset Splits No The paper specifies using D4RL and Neo RL datasets and mentions types like 'random', 'medium', 'medium-replay', 'medium-expert', and 'low', 'medium', 'high' datasets with '1000 trajectories each' for Neo RL. However, it does not explicitly provide information on how these datasets were split into training, validation, or test sets for the paper's experiments, or refer to standard splits with specific details (e.g., 80/10/10 percentages or counts).
Hardware Specification Yes We train RL agents on a cluster with NVIDIA RTX A5000 GPUs and an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz. We train all our models on a laptop computer with an Intel i9-9900 3.1 GHz CPU with 32 GB of RAM and a Ge Force RTX 2060, TU106.
Software Dependencies Yes We implement all the numerical experiments using the python library JAX Bradbury et al. (2018), in order to take advantage of its automatic differentiation and just-in-time compilation features. We use Python 3.8.5 for the experiments
Experiment Setup Yes NUNO has four hyperparameters: real ratio β, rollout length h, CVa R coefficient α to set a truncation threshold, and uncertainty penalization threshold λpen. The real ratio parameter β refers to the ratio of samples from the real dataset in a mini-batch used to update the SAC policy. We set β to 0.05, as TATU+MOPO, for all tasks in our experiments. For the rest of the parameters, we run a search over the following set of values: h {5, 10, 15, 20}, α {0.9, 0.95, 0.98, 0.99, 1.0}, and λpen {0.001, 0.1, 1}. Our hyperparameter search procedure starts by tuning for rollout length h with α = 0.9 and λpen = 0.001. Using the best performing, namely, the highest human-normalized score yielding rollout length, we tune for α. Finally, we run a search for λpen. Table 3 reports the best-performing values for each task in our experiments. We use the Adam optimizer (Kingma & Ba, 2014) for all optimization problems. We use the default hyperparameters for the optimizer, except for the learning rate, which we linearly decay from 0.01 to 0.001 over the first 5000 gradient steps. We use early stopping criteria for all our experiments. We use a batch size of 128 for the neural SDE training. For the neural SDE architecture, we parameterize ηϕ as a neural network with two hidden layers of size 64 with swish activation functions. We parameterize the uncertainty term σϕ as a neural network with two hidden layers of size 256 with tanh activation functions. The reward s drift term f reward θ is parameterized as a neural network with three hidden layers of size 64 with swish activation functions while the other drift terms are parameterized with three hidden layers of size 256 and swish activation functions. Finally, the strong convexity neural network is parameterized with two hidden layers of size 32 with swish activation functions.