Global Convergence Guarantees for Federated Policy Gradient Methods with Adversaries

Authors: Swetha Ganesh, Jiayu Chen, Gugan Thoppe, Vaneet Aggarwal

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Evaluation To show the effectiveness of our algorithm design (i.e., Res-NHARPG), we provide evaluation results on two commonly-used continuous control tasks: Cart Pole-v1 from Open AI Gym (Brockman et al., 2016) and Inverted Pendulum-v2 from Mu Jo Co (Todorov et al., 2012). Additional experiments on more demanding Mu Jo Co tasks, including Half Cheetah, Hopper, Inverted Double Pendulum, and Walker are provided in Appendix A. For each task on Cart Pole-v1 and Inverted Pendulum-v2, there are ten workers to individually sample trajectories and compute gradients, and three of them are adversaries who would apply attacks to the learning process. Note that we do not know which worker is an adversary, so we cannot simply ignore certain gradient estimates to avoid the attacks. We simulate three types of attacks to the learning process: random noise, random action, and sign flipping. In Figure 1, we present the learning process of the eight algorithms in two environments under three types of attacks. In each subfigure, the x-axis represents the number of sampled trajectories; the y-axis records the acquired trajectory return of the learned policy during evaluation. Each algorithm is repeated five times with different random seeds. The average performance and 95% confidence interval are shown as the solid line and shadow area, respectively. Codes for our experiments have been submitted as supplementary material and will be made public.
Researcher Affiliation Academia Swetha Ganesh EMAIL Indian Institute of Science (IISc), Bengaluru 560012, India Purdue University, West Lafayette, IN, 47907, USA Jiayu Chen EMAIL Carnegie Mellon University (CMU), Pittsburgh, PA, 15289, USA Gugan Thoppe EMAIL Indian Institute of Science (IISc), Bengaluru 560012, India Vaneet Aggarwal EMAIL Purdue University, West Lafayette, IN, 47907, USA
Pseudocode Yes Algorithm 1 Resilient Normalized Hessian-Aided Recursive Policy Gradient (Res-NHARPG) 1: Input: θ0, θ1, d0, T, {ηt}t 1, {γt}t 1 2: for t = 1, . . . , T 1 do 3: Server broadcasts θt to all agents 4: Agent update 5: for each agent n [N] do in parallel 6: q(n) t U([0, 1]) 7: ˆθ(n) t = q(n) t θt + (1 q(n) t )θt 1 8: τ (n) t p H ρ ( |πθt); ˆτ (n) t p( |πˆθ(n) t ) 9: v(n) t = B(ˆτ (n) t , ˆθ(n) t )(θt θt 1) 10: d(n) t = (1 ηt)(d(n) t 1 + v(n) t ) + ηtg(τ (n) t , θt) 11: end for 12: Server Update 13: dt = F(d(1) t , , d(N) t ) 14: θt+1 = θt + γt dt dt 15: end for 16: return θT
Open Source Code Yes Codes for our experiments have been submitted as supplementary material and will be made public.
Open Datasets Yes Cart Pole-v1 from Open AI Gym (Brockman et al., 2016) and Inverted Pendulum-v2 from Mu Jo Co (Todorov et al., 2012). Additional experiments on more demanding Mu Jo Co tasks, including Half Cheetah, Hopper, Inverted Double Pendulum, and Walker are provided in Appendix A.
Dataset Splits No The paper describes using environments like Cart Pole-v1 and Inverted Pendulum-v2 from Open AI Gym and Mu Jo Co, where agents sample trajectories. It mentions algorithms being repeated five times with different random seeds. However, it does not explicitly provide details about traditional training, validation, or test dataset splits (e.g., specific percentages or sample counts for a static dataset), which is common in reinforcement learning where interaction with the environment generates data dynamically.
Hardware Specification Yes Experiments were conducted using the Oracle Cloud infrastructure, where each computation instance was equipped with 8 Intel Xeon Platinum CPU cores and 128 GB of memory.
Software Dependencies No The paper mentions using 'Open AI Gym' and 'Mu Jo Co' environments. It references 'Tianshou: A highly modularized deep reinforcement learning library' in Appendix A. However, it does not provide specific version numbers for any of these software components or any other libraries or programming languages used, which is required for reproducibility.
Experiment Setup Yes Consider Algorithm 1 with γt = 6G1 µF (t+2), ηt = 1 t and H = (1 γ) 1 log(T + 1). Let Assumptions 4.1, 4.2, 4.3 and 4.4 hold. Then for every T 1 the output θT satisfies J J(θT ) = εbias log N + log T .