Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities
Authors: Ziyi Chen, Yi Zhou, Rong-Rong Chen
TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments: We simulate a multi-agent MDP with 10 decentralized agents. The shared state space contains 10 states and each agent can take 2 actions. ... We implement and compare two algorithms in these networks: the decentralized TD(0) with batch size N = 1 (used in Sun et al. (2020)) and our decentralized TDC with batch sizes N = 1, 10, 20, 50, 100. ... Figure 1 plots the relative convergence error θt θ / θ v.s. the number of samples (t N) and the number of communication rounds (t) . For each curve, its upper and lower envelopes denote the 95% and 5% percentiles of the 100 convergence errors, respectively. |
| Researcher Affiliation | Academia | Ziyi Chen EMAIL Department of Electrical and Computer Engineering University of Utah; Yi Zhou EMAIL Department of Electrical and Computer Engineering University of Utah; Rong-Rong Chen EMAIL Department of Electrical and Computer Engineering University of Utah |
| Pseudocode | Yes | Algorithm 1 Decentralized mini-batch TDC. Input: Batch size N, iterations T, T , learning rates α, β. Initialize: θ(m) 0 , w(m) 0 for all agents m M. for iteration t = 0, 1, . . . , T 1 do Each agent collects N Markovian samples and computes their local importance sampling ratios ρ(m) i . for communication round ℓ= 0, 1, . . . , L 1 do for agent m M in parallel do Communicate eρ(m) i,ℓ via eq. (5) for the N samples i = t N, . . . , (t + 1)N 1. end end for agent m M in parallel do Agent m estimates global importance sampling ratios bρ(m) i via eq. (6) for i = t N, . . . , (t + 1)N 1, and then performs the updates in eqs. (7) and (8). end end for iteration t = T, T + 1, . . . , T + T 1 do for agent m M in parallel do θ(m) t+1 = P m Nm Um,m θ(m ) t . end end Output: {θ(m) T +T }M m=1. |
| Open Source Code | No | The paper does not contain an explicit statement or a direct link to the source code for the methodology described. |
| Open Datasets | No | The paper describes simulated environments: a "simulated multi-agent MDP", an adaptation of the "Two-Agent CliffNavigation problem" and a "multi-agent path finding problem". These are either custom simulations or adaptations of existing problems, without providing specific access information (links, DOIs, or citations to public repositories) for the datasets or simulation environments used for their experiments. |
| Dataset Splits | No | The paper states, "All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples." However, it does not specify how these samples or trajectories are split into training, validation, or test sets for the learning process itself. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper describes the algorithms and their theoretical analysis but does not specify any software libraries, frameworks, or their version numbers that were used for implementation or experimentation. |
| Experiment Setup | Yes | We set learning rate α = 0.2 for the decentralized TD(0) and α = 0.2 N, β = 0.002 N for our decentralized TDC with varying batch sizes N = 1, 10, 20, 50, 100. Both algorithms use L = 3 communication rounds for synchronizing bρ(m) t . All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples. ... For the path finding problem: (3) We use α = 0.08 N obtained by tuning while β = 0.002 N is the same as aforementioned. |