reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Preserving Expert-Level Privacy in Offline Reinforcement Learning

Authors: Navodita Sharma, Vishnu Vinod, Abhradeep Guha Thakurta, Alekh Agarwal, Borja Balle, Christoph Dann, Aravindan Raghuveer

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To provably protect the privacy of such experts, we propose a novel consensus-based expertlevel diﬀerentially private oﬄine RL training approach compatible with any existing oﬄine RL algorithm. We prove rigorous diﬀerential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in diﬀerentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.
Researcher Affiliation	Collaboration	Navodita Sharma* EMAIL Google Deep Mind Vishnu Vinod* EMAIL Ce RAI, IIT Madras Abhradeep Thakurta EMAIL Google Deep Mind Alekh Agarwal EMAIL Google Research Borja Balle EMAIL Google Deep Mind Christoph Dann EMAIL Google Research Aravindan Raghuveer EMAIL Google Deep Mind
Pseudocode	Yes	Algorithm 1 Preﬁx Query (AP Q): Tests if count of experts expected to execute a trajectory is large enough Algorithm 2 Data Release (ADR): Releasing public dataset after privatisation Algorithm 3 Selective DP-SGD for Oﬄine RL Algorithm 4 Expert-Level DP-SGD (outline)
Open Source Code	No	The paper describes the implementation details of their method and the DP-SGD baseline but does not provide any explicit statement about releasing their source code, nor does it include a link to a code repository. It only mentions using a third-party library: "We use the PLD accountant from the tensorflow_privacy library."
Open Datasets	Yes	Dataset Generation: We train 3000 experts each on the Cartpole, Acrobot, Lunar Lander and HIV Treatment environments. Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). The HIV Treatment simulator is based on the implementation by Geramifard et al. (2015) of the model described in Ernst et al. (2006).
Dataset Splits	No	The paper describes how the overall offline dataset is generated by sampling 20 trajectories from modified experts in default environments. It also mentions a data filtering stage (Algorithm 2) that creates DΠ stable and DΠ unst subsets for training. However, it does not specify conventional train/validation/test splits (e.g., 80/10/10%) for the entire dataset used for evaluation of the final learned policy.
Hardware Specification	No	The paper details the neural network architectures used for Q-value functions and agents (e.g., "neural network with 2 hidden layers of 256 units each"). However, it does not provide any specific information regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	Yes	Cartpole, Acrobot and Lunar Lander are from the Gymnasium package (Towers et al., 2023). We use the PLD accountant from the tensorflow_privacy library.
Experiment Setup	Yes	We tune the following hyperparameters for our method: learning rate η, batch size b, probability of sampling from DΠ unst during training (p), and DP-SGD noise multiplier (s) which is the standard deviation of the gaussian noise applied to the gradients divided by the clipping threshold. We perform a grid search over all hyper-parameters. The search spaces for diﬀerent hyperparameters are as follows: η [0.0001, 0.0005, 0.001, 0.005, 0.01], b [64, 128, 256], p [0.9, 0.8, 0.5, 0.0], and s [10.0, 20.0, 30.0, 40.0, 50.0, 80.0]. For all the environments, we assume that pmin = 0.02. For DP-SGD updates, we clip all gradient norms to 1.0. Table 3: Best hyper-parameter settings across diﬀerent environments, oﬄine-RL algorithms and ε values.