Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities

Authors: Ziyi Chen, Yi Zhou, Rong-Rong Chen

TMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments: We simulate a multi-agent MDP with 10 decentralized agents. The shared state space contains 10 states and each agent can take 2 actions. ... We implement and compare two algorithms in these networks: the decentralized TD(0) with batch size N = 1 (used in Sun et al. (2020)) and our decentralized TDC with batch sizes N = 1, 10, 20, 50, 100. ... Figure 1 plots the relative convergence error θt θ / θ v.s. the number of samples (t N) and the number of communication rounds (t) . For each curve, its upper and lower envelopes denote the 95% and 5% percentiles of the 100 convergence errors, respectively.
Researcher Affiliation Academia Ziyi Chen EMAIL Department of Electrical and Computer Engineering University of Utah; Yi Zhou EMAIL Department of Electrical and Computer Engineering University of Utah; Rong-Rong Chen EMAIL Department of Electrical and Computer Engineering University of Utah
Pseudocode Yes Algorithm 1 Decentralized mini-batch TDC. Input: Batch size N, iterations T, T , learning rates α, β. Initialize: θ(m) 0 , w(m) 0 for all agents m M. for iteration t = 0, 1, . . . , T 1 do Each agent collects N Markovian samples and computes their local importance sampling ratios ρ(m) i . for communication round ℓ= 0, 1, . . . , L 1 do for agent m M in parallel do Communicate eρ(m) i,ℓ via eq. (5) for the N samples i = t N, . . . , (t + 1)N 1. end end for agent m M in parallel do Agent m estimates global importance sampling ratios bρ(m) i via eq. (6) for i = t N, . . . , (t + 1)N 1, and then performs the updates in eqs. (7) and (8). end end for iteration t = T, T + 1, . . . , T + T 1 do for agent m M in parallel do θ(m) t+1 = P m Nm Um,m θ(m ) t . end end Output: {θ(m) T +T }M m=1.
Open Source Code No The paper does not contain an explicit statement or a direct link to the source code for the methodology described.
Open Datasets No The paper describes simulated environments: a "simulated multi-agent MDP", an adaptation of the "Two-Agent CliffNavigation problem" and a "multi-agent path finding problem". These are either custom simulations or adaptations of existing problems, without providing specific access information (links, DOIs, or citations to public repositories) for the datasets or simulation environments used for their experiments.
Dataset Splits No The paper states, "All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples." However, it does not specify how these samples or trajectories are split into training, validation, or test sets for the learning process itself.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments.
Software Dependencies No The paper describes the algorithms and their theoretical analysis but does not specify any software libraries, frameworks, or their version numbers that were used for implementation or experimentation.
Experiment Setup Yes We set learning rate α = 0.2 for the decentralized TD(0) and α = 0.2 N, β = 0.002 N for our decentralized TDC with varying batch sizes N = 1, 10, 20, 50, 100. Both algorithms use L = 3 communication rounds for synchronizing bρ(m) t . All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples. ... For the path finding problem: (3) We use α = 0.08 N obtained by tuning while β = 0.002 N is the same as aforementioned.