Preserving Expert-Level Privacy in Offline Reinforcement Learning
Authors: Navodita Sharma, Vishnu Vinod, Abhradeep Guha Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To provably protect the privacy of such experts, we propose a novel consensus-based expertlevel differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks. |
| Researcher Affiliation | Collaboration | Navodita Sharma* EMAIL Google Deep Mind Vishnu Vinod* EMAIL Ce RAI, IIT Madras Abhradeep Thakurta EMAIL Google Deep Mind Alekh Agarwal EMAIL Google Research Borja Balle EMAIL Google Deep Mind Christoph Dann EMAIL Google Research Aravindan Raghuveer EMAIL Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Prefix Query (AP Q): Tests if count of experts expected to execute a trajectory is large enough Algorithm 2 Data Release (ADR): Releasing public dataset after privatisation Algorithm 3 Selective DP-SGD for Offline RL Algorithm 4 Expert-Level DP-SGD (outline) |
| Open Source Code | No | The paper describes the implementation details of their method and the DP-SGD baseline but does not provide any explicit statement about releasing their source code, nor does it include a link to a code repository. It only mentions using a third-party library: "We use the PLD accountant from the tensorflow_privacy library." |
| Open Datasets | Yes | Dataset Generation: We train 3000 experts each on the Cartpole, Acrobot, Lunar Lander and HIV Treatment environments. Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). The HIV Treatment simulator is based on the implementation by Geramifard et al. (2015) of the model described in Ernst et al. (2006). |
| Dataset Splits | No | The paper describes how the overall offline dataset is generated by sampling 20 trajectories from modified experts in default environments. It also mentions a data filtering stage (Algorithm 2) that creates DΠ stable and DΠ unst subsets for training. However, it does not specify conventional train/validation/test splits (e.g., 80/10/10%) for the entire dataset used for evaluation of the final learned policy. |
| Hardware Specification | No | The paper details the neural network architectures used for Q-value functions and agents (e.g., "neural network with 2 hidden layers of 256 units each"). However, it does not provide any specific information regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | Yes | Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). We use the PLD accountant from the tensorflow_privacy library. |
| Experiment Setup | Yes | We tune the following hyperparameters for our method: learning rate η, batch size b, probability of sampling from DΠ unst during training (p), and DP-SGD noise multiplier (s) which is the standard deviation of the gaussian noise applied to the gradients divided by the clipping threshold. We perform a grid search over all hyper-parameters. The search spaces for different hyperparameters are as follows: η [0.0001, 0.0005, 0.001, 0.005, 0.01], b [64, 128, 256], p [0.9, 0.8, 0.5, 0.0], and s [10.0, 20.0, 30.0, 40.0, 50.0, 80.0]. For all the environments, we assume that pmin = 0.02. For DP-SGD updates, we clip all gradient norms to 1.0. Table 3: Best hyper-parameter settings across different environments, offline-RL algorithms and ε values. |