Preserving Expert-Level Privacy in Offline Reinforcement Learning

Authors: Navodita Sharma, Vishnu Vinod, Abhradeep Guha Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To provably protect the privacy of such experts, we propose a novel consensus-based expertlevel differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.
Researcher Affiliation Collaboration Navodita Sharma* EMAIL Google Deep Mind Vishnu Vinod* EMAIL Ce RAI, IIT Madras Abhradeep Thakurta EMAIL Google Deep Mind Alekh Agarwal EMAIL Google Research Borja Balle EMAIL Google Deep Mind Christoph Dann EMAIL Google Research Aravindan Raghuveer EMAIL Google Deep Mind
Pseudocode Yes Algorithm 1 Prefix Query (AP Q): Tests if count of experts expected to execute a trajectory is large enough Algorithm 2 Data Release (ADR): Releasing public dataset after privatisation Algorithm 3 Selective DP-SGD for Offline RL Algorithm 4 Expert-Level DP-SGD (outline)
Open Source Code No The paper describes the implementation details of their method and the DP-SGD baseline but does not provide any explicit statement about releasing their source code, nor does it include a link to a code repository. It only mentions using a third-party library: "We use the PLD accountant from the tensorflow_privacy library."
Open Datasets Yes Dataset Generation: We train 3000 experts each on the Cartpole, Acrobot, Lunar Lander and HIV Treatment environments. Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). The HIV Treatment simulator is based on the implementation by Geramifard et al. (2015) of the model described in Ernst et al. (2006).
Dataset Splits No The paper describes how the overall offline dataset is generated by sampling 20 trajectories from modified experts in default environments. It also mentions a data filtering stage (Algorithm 2) that creates DΠ stable and DΠ unst subsets for training. However, it does not specify conventional train/validation/test splits (e.g., 80/10/10%) for the entire dataset used for evaluation of the final learned policy.
Hardware Specification No The paper details the neural network architectures used for Q-value functions and agents (e.g., "neural network with 2 hidden layers of 256 units each"). However, it does not provide any specific information regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies Yes Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). We use the PLD accountant from the tensorflow_privacy library.
Experiment Setup Yes We tune the following hyperparameters for our method: learning rate η, batch size b, probability of sampling from DΠ unst during training (p), and DP-SGD noise multiplier (s) which is the standard deviation of the gaussian noise applied to the gradients divided by the clipping threshold. We perform a grid search over all hyper-parameters. The search spaces for different hyperparameters are as follows: η [0.0001, 0.0005, 0.001, 0.005, 0.01], b [64, 128, 256], p [0.9, 0.8, 0.5, 0.0], and s [10.0, 20.0, 30.0, 40.0, 50.0, 80.0]. For all the environments, we assume that pmin = 0.02. For DP-SGD updates, we clip all gradient norms to 1.0. Table 3: Best hyper-parameter settings across different environments, offline-RL algorithms and ε values.