reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities

Authors: Ziyi Chen, Yi Zhou, Rong-Rong Chen

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments: We simulate a multi-agent MDP with 10 decentralized agents. The shared state space contains 10 states and each agent can take 2 actions. ... We implement and compare two algorithms in these networks: the decentralized TD(0) with batch size N = 1 (used in Sun et al. (2020)) and our decentralized TDC with batch sizes N = 1, 10, 20, 50, 100. ... Figure 1 plots the relative convergence error θt θ / θ v.s. the number of samples (t N) and the number of communication rounds (t) . For each curve, its upper and lower envelopes denote the 95% and 5% percentiles of the 100 convergence errors, respectively.
Researcher Affiliation	Academia	Ziyi Chen EMAIL Department of Electrical and Computer Engineering University of Utah; Yi Zhou EMAIL Department of Electrical and Computer Engineering University of Utah; Rong-Rong Chen EMAIL Department of Electrical and Computer Engineering University of Utah
Pseudocode	Yes	Algorithm 1 Decentralized mini-batch TDC. Input: Batch size N, iterations T, T , learning rates α, β. Initialize: θ(m) 0 , w(m) 0 for all agents m M. for iteration t = 0, 1, . . . , T 1 do Each agent collects N Markovian samples and computes their local importance sampling ratios ρ(m) i . for communication round ℓ= 0, 1, . . . , L 1 do for agent m M in parallel do Communicate eρ(m) i,ℓ via eq. (5) for the N samples i = t N, . . . , (t + 1)N 1. end end for agent m M in parallel do Agent m estimates global importance sampling ratios bρ(m) i via eq. (6) for i = t N, . . . , (t + 1)N 1, and then performs the updates in eqs. (7) and (8). end end for iteration t = T, T + 1, . . . , T + T 1 do for agent m M in parallel do θ(m) t+1 = P m Nm Um,m θ(m ) t . end end Output: {θ(m) T +T }M m=1.
Open Source Code	No	The paper does not contain an explicit statement or a direct link to the source code for the methodology described.
Open Datasets	No	The paper describes simulated environments: a "simulated multi-agent MDP", an adaptation of the "Two-Agent CliffNavigation problem" and a "multi-agent path finding problem". These are either custom simulations or adaptations of existing problems, without providing specific access information (links, DOIs, or citations to public repositories) for the datasets or simulation environments used for their experiments.
Dataset Splits	No	The paper states, "All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples." However, it does not specify how these samples or trajectories are split into training, validation, or test sets for the learning process itself.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory specifications) used to run the experiments.
Software Dependencies	No	The paper describes the algorithms and their theoretical analysis but does not specify any software libraries, frameworks, or their version numbers that were used for implementation or experimentation.
Experiment Setup	Yes	We set learning rate α = 0.2 for the decentralized TD(0) and α = 0.2 N, β = 0.002 N for our decentralized TDC with varying batch sizes N = 1, 10, 20, 50, 100. Both algorithms use L = 3 communication rounds for synchronizing bρ(m) t . All algorithms are repeated 100 times using a fixed set of 100 MDP trajectories, each of which has 20k Markovian samples. ... For the path finding problem: (3) We use α = 0.08 N obtained by tuning while β = 0.002 N is the same as aforementioned.