Reward Poisoning on Federated Reinforcement Learning

Authors: Evelyn Ma, S. Rasoul Etesami, Praneet Rathi

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness of our poisoning approach through comprehensive experiments, supported by mainstream RL algorithms, across various RL Open AI Gym environments covering a wide range of difficulty levels. Within these experiments, we assess our proposed attack by comparing it to various baselines, including standard, poisoned, and robust FRL methods.
Researcher Affiliation Academia Evelyn Ma EMAIL Department of Industrial and Systems Engineering University of Illinois Urbana-Champaign S. Rasoul Etesami EMAIL Department of Industrial and Systems Engineering, Coordinated Science Lab University of Illinois Urbana-Champaign Praneet Rathi EMAIL Department of Computer Science University of Illinois Urbana-Champaign
Pseudocode Yes Algorithm 1 Poisoned Local Train for Actor-Critic-based FRL Algorithm 2 Reward Poisoning for Actor-Critic-based FRL Algorithm 3 Reward Poisoning for Policy Gradient-based FRL Algorithm 4 Standard Actor-Critic-based FRL Algorithm 5 Standard Policy-Gradient-based FRL Algorithm 6 FRL Defense Aggregation
Open Source Code No The paper does not explicitly provide concrete access to source code (e.g., a specific repository link, an explicit code release statement, or code in supplementary materials) for the methodology described in the paper.
Open Datasets Yes Our method is evaluated through extensive experiments on Open GYM environments (Brockman et al., 2016), which represent standard RL tasks across various difficulty levels such as Cart Pole, Inverted Pendulum, Lunar Lander, Hopper, Walker2d, and Half Cheetah.
Dataset Splits No For untargeted poisoning, we evaluate the performance of these methods by measuring the mean-episode reward of the central model, which is calculated based on 100 test episodes at the end of each federated round. For targeted poisoning, we measure the similarity between learned policy and targeted policy. The paper mentions evaluating on '100 test episodes' but does not provide specific training/validation/test splits of the underlying environments or datasets in the conventional sense for supervised learning tasks, nor does it specify how episodes are partitioned for training vs. evaluation within the learning process.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. It mentions algorithms like VPG and PPO but not their software implementations or versions.
Experiment Setup Yes The learning rate is set to 0.001, and the discount parameter is set to γ = 0.99. There are 200 total communication rounds, and all agents run 5 local steps in each communication round.