A Primal-dual Perspective for Distributed TD-learning
Authors: Han Dong Lim, Donghwan Lee
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments This section provides the experimental results of Algorithm 1. First, we give an explanation of the MAMDP setup, where the number of states is three and the dimension of the feature is two. An agent can transit to every state with uniform prob-ability. The feature matrix is set as Φ = 1 0 1 0 1 0 rewards are generated uniformly random between the interval (0, 10). The discount factor is set as 0.8. For each experiment with N {23, 25} number of agents, we construct a cycle, a graph G consisting of V := {1, 2, . . . , N} and E := {(i, i+1)}N 1 i=1 {(N, 1)}. The smallest non-zero eigenvalue of graph Laplacian corresponding to a cycle with even number of vertices decreases as the number of vertices increases, while maximum eigenvalue remains same. The smallest non-zero eigenvalue is 2 2 cos 2π N , and the largest eigenvalue is four [Mohar, 1997]. As N gets larger, the smallest non-zero eigenvalue gets smaller, which becomes 0.59 and 0.04 for N = 23, 25, respectively. Therefore, as number of agents increases, the convergence rate will be slower as expected in Theorem 5, and this can be verified in Figure (1a) and Figure (3) in the Appendix. The plots show the result for constant step-size α0 {2 3, 2 4, 2 5, 2 6}. Moreover, the convergence under a diminishing step-size can be seen in Figure (1b). To investigate the effect of λmax( L), we construct a star graph, where one vertex has degree N 1 and the others have degree one. The maximum eigenvalue of star graph is N and the smallest non-zero eigenvalue is one [Nica, 2016]. Even though N gets larger, we could see in Figure (1e) and (1f) that the convergence rate or bias term does not vary. Therefore, we can expect that there could be a tighter bound without λmax( L) under particular cases. To verify the effect of η, we use a random graph model [Erd os et al., 1960], where among possible N(N 1)/2 edges, (N 3)(N 4)/2 edges are randomly selected. Figure (1c) shows the evolution of the mean squared error for N = 32, and step-size 0.1 with different η values. When η = 0.5 or η = 1, the algorithm diverges. Moreover, the bias gets smaller around 2 λmax(L) 0.046. This implies that appropriate choice of η can control the variance when the number of neighbors is large but if η is too small or large, Algorithm 1 may cause divergence or large bias. This matches the result of the bound in Theorem 5. Similar arguments hold when N = 8, and the result is given in Figure (1d). Lastly, the comparison with other algorithms are given in Appendix G. In summary, while no single algorithm consistently outperforms the others, the performance of methods that rely on the doubly stochastic matrix is highly sensitive to the choice of this matrix. |
| Researcher Affiliation | Academia | Han-Dong Lim , Donghwan Lee Department of Electrical Engineering, KAIST EMAIL |
| Pseudocode | Yes | Algorithm 1 Distributed TD-learning Initialize α0 (0, 1), {θi 0, wi 0 Rq}N i=1, η (0, ). for k = 1, 2, . . . , T do for i = 1, 2, . . . , N do Agent i observes oi k := (sk, s k, ri k). Update as follows: δ(oi k; θi k) =ri k + γϕ (s k)θi k ϕ (sk)θi k (6) θi k+1 =θi k + αk(δ(oi k; θi k)ϕ(sk) η(|Ni|θi k P j Ni θj k) η(|Ni|wi k P j Ni wj k)) (7) wi k+1 =wi k + αkη(|Ni|θi k P j Ni θj k) (8) end for end for |
| Open Source Code | Yes | 2The code is provided in this link. |
| Open Datasets | No | The paper describes its own MAMDP setup and generates rewards uniformly randomly. It does not refer to any publicly available datasets or provide access information for any data used in the experiments. Text: "First, we give an explanation of the MAMDP setup, where the number of states is three and the dimension of the feature is two. An agent can transit to every state with uniform prob-ability. The feature matrix is set as Φ = 1 0 1 0 1 0 rewards are generated uniformly random between the interval (0, 10). The discount factor is set as 0.8." |
| Dataset Splits | No | The paper uses a simulated multi-agent Markov decision process (MAMDP) setup rather than a traditional dataset with predefined splits. There is no mention of training, validation, or test splits. Text: "First, we give an explanation of the MAMDP setup, where the number of states is three and the dimension of the feature is two. An agent can transit to every state with uniform prob-ability. The feature matrix is set as Φ = 1 0 1 0 1 0 rewards are generated uniformly random between the interval (0, 10). The discount factor is set as 0.8." |
| Hardware Specification | No | The paper describes the experimental results and setup parameters, but does not provide any specific details about the hardware (e.g., CPU, GPU, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions that code is provided in a link, but does not list any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | 5 Experiments This section provides the experimental results of Algorithm 1. First, we give an explanation of the MAMDP setup, where the number of states is three and the dimension of the feature is two. An agent can transit to every state with uniform prob-ability. The feature matrix is set as Φ = 1 0 1 0 1 0 rewards are generated uniformly random between the interval (0, 10). The discount factor is set as 0.8. For each experiment with N {23, 25} number of agents, we construct a cycle, a graph G consisting of V := {1, 2, . . . , N} and E := {(i, i+1)}N 1 i=1 {(N, 1)}. The smallest non-zero eigenvalue of graph Laplacian corresponding to a cycle with even number of vertices decreases as the number of vertices increases, while maximum eigenvalue remains same. The smallest non-zero eigenvalue is 2 2 cos 2π N , and the largest eigenvalue is four [Mohar, 1997]. As N gets larger, the smallest non-zero eigenvalue gets smaller, which becomes 0.59 and 0.04 for N = 23, 25, respectively. Therefore, as number of agents increases, the convergence rate will be slower as expected in Theorem 5, and this can be verified in Figure (1a) and Figure (3) in the Appendix. The plots show the result for constant step-size α0 {2 3, 2 4, 2 5, 2 6}. Moreover, the convergence under a diminishing step-size can be seen in Figure (1b). To investigate the effect of λmax( L), we construct a star graph, where one vertex has degree N 1 and the others have degree one. The maximum eigenvalue of star graph is N and the smallest non-zero eigenvalue is one [Nica, 2016]. Even though N gets larger, we could see in Figure (1e) and (1f) that the convergence rate or bias term does not vary. Therefore, we can expect that there could be a tighter bound without λmax( L) under particular cases. To verify the effect of η, we use a random graph model [Erd os et al., 1960], where among possible N(N 1)/2 edges, (N 3)(N 4)/2 edges are randomly selected. Figure (1c) shows the evolution of the mean squared error for N = 32, and step-size 0.1 with different η values. When η = 0.5 or η = 1, the algorithm diverges. Moreover, the bias gets smaller around 2 λmax(L) 0.046. This implies that appropriate choice of η can control the variance when the number of neighbors is large but if η is too small or large, Algorithm 1 may cause divergence or large bias. This matches the result of the bound in Theorem 5. Similar arguments hold when N = 8, and the result is given in Figure (1d). Lastly, the comparison with other algorithms are given in Appendix G. In summary, while no single algorithm consistently outperforms the others, the performance of methods that rely on the doubly stochastic matrix is highly sensitive to the choice of this matrix. |