On the Convergence Rates of Federated Q-Learning across Heterogeneous Environments

Authors: Muxing Wang, Pengkun Yang, Lili Su

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our numerical results illustrate that when the environments are heterogeneous and E > 1, there exists a sharp phase-transition of the error convergence for not too small stepsizes: The error decays rapidly in the beginning yet later bounces up and stabilizes. For example, in our lower bound analysis, we picked a threshold λ0 and showed that no matter λ is greater or smaller than λ0, the convergence rate is greater than ΘR (E/((1 γ)T)), and we can claim we have covered all the cases. However, for time-varying stepsize, the number of stepsizes is T, and it is not easy to generalize a similar result by just considering several cases because each stepsize gives an additional dimension. Even if we know the sequence is decaying, without specifying a particular family of stepsizes, it is not possible to divide it into several cases as we did for time-invariant stepsize. We conjecture that both approaches lead to comparable residual error levels over extended training. For example, the stepsizes used in Figure 4a and Figure 4b are 1 T and 1 t+1, respectively. While the time-decaying stepsize appears to have faster initial convergence due to its larger values, we observe that as t increases, the convergence rates of the two strategies seem to align, suggesting a similar asymptotic behavior. 6 Experiments Description of the setup. In our experiments, we consider K = 20 agents (Jin et al., 2022), each interacting with an independently and randomly generated 5 5 maze environment S, A, Pk, R, γ for k {1, 2, , 20}. The state set S contains 25 cells that the agent is currently in. The action set contains 4 actions A = {left, up, right, down}. Thus, |S| |A| = 100. We choose γ = 0.99. For ease of verifying our theory, each entry of the reward R R100 is sampled from Bern(p = 0.05), which slightly departs from a typical maze environment wherein only two state-action pairs have nonzero rewards. We choose this reward so that 0 100 = 1 1 γ , which is the coarse upper bound of t for all t. For each agent k, its state transition probability vectors Pk are constructed on top of standard state transition probability vectors of maze environments incorporated with a drifting probability 0.1 in each non-intentional action as in Windy Cliff (Jin et al., 2022; Paul et al., 2019). In this way, the environment heterogeneity lies not only in the differences of the non-zero probability values (Jin et al., 2022; Paul et al., 2019) but also in the probability supports (i.e., the locations of non-zero entries). Our construction is more challenging: The Published in Transactions on Machine Learning Research (08/2025) environment heterogeneity κ as per (2) of our environment construction was calculated to be 1.2. Yet, the largest environment heterogeneity of the Windy Cliff construction in Jin et al. (2022) is about 0.31. We choose Q0 = 0 R100. All numerical results are based on 5 independent runs to capture the variability. The dark lines represent the mean of the runs, while the shaded areas around each line illustrate the range obtained by adding and subtracting one standard deviation from the mean. The maximum time duration is T = 20, 000 in our experiment since it is sufficient to capture the characteristics of the training process. Convergence behavior and two-phase phenomenon. We demonstrate through numerical simulations that our analysis aligns with the observed behaviors. For algorithms with a time-invariant stepsize, convergence requires sufficiently small stepsizes and a sufficiently large number of iterations T.
Researcher Affiliation Academia Leo (Muxing) Wang EMAIL Northeastern University Pengkun Yang EMAIL Tsinghua University Lili Su EMAIL Northeastern University
Pseudocode Yes Algorithm 1 Synchronous Federated Q-Learning Inputs: discount factor γ, E, total iteration T, stepsize λ, initial estimate Q0 1: for k [K] do 2: Qk 0 = Q0 3: end for 4: for t = 0 to T 1 do 5: for k [K] and (s, a) S A do 6: Qk t+ 1 2 (s, a) = (1 λ)Qk t (s, a) + λ R(s, a) + γ maxa A Qk t (sk t (s, a), a ) 7: if (t + 1) mod E = 0 then 8: Qk t+1 = 1 K PK k=1 Qk t+ 1 2 9: else 10: Qk t+1 = Qk t+ 1 2 11: end if 12: end for 13: end for 14: return QT = 1 K PK k=1 Qk T
Open Source Code No The paper does not contain any explicit statement about releasing code or a link to a code repository.
Open Datasets No In our experiments, we consider K = 20 agents (Jin et al., 2022), each interacting with an independently and randomly generated 5 5 maze environment S, A, Pk, R, γ for k {1, 2, , 20}.
Dataset Splits No The paper describes using a randomly generated maze environment for experiments but does not specify any formal training, validation, or test splits for a static dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., library or framework versions).
Experiment Setup Yes Description of the setup. In our experiments, we consider K = 20 agents (Jin et al., 2022), each interacting with an independently and randomly generated 5 5 maze environment S, A, Pk, R, γ for k {1, 2, , 20}. The state set S contains 25 cells that the agent is currently in. The action set contains 4 actions A = {left, up, right, down}. Thus, |S| |A| = 100. We choose γ = 0.99. For ease of verifying our theory, each entry of the reward R R100 is sampled from Bern(p = 0.05), which slightly departs from a typical maze environment wherein only two state-action pairs have nonzero rewards. We choose this reward so that 0 100 = 1 1 γ , which is the coarse upper bound of t for all t. For each agent k, its state transition probability vectors Pk are constructed on top of standard state transition probability vectors of maze environments incorporated with a drifting probability 0.1 in each non-intentional action as in Windy Cliff (Jin et al., 2022; Paul et al., 2019). In this way, the environment heterogeneity lies not only in the differences of the non-zero probability values (Jin et al., 2022; Paul et al., 2019) but also in the probability supports (i.e., the locations of non-zero entries). Our construction is more challenging: The Published in Transactions on Machine Learning Research (08/2025) environment heterogeneity κ as per (2) of our environment construction was calculated to be 1.2. Yet, the largest environment heterogeneity of the Windy Cliff construction in Jin et al. (2022) is about 0.31. We choose Q0 = 0 R100. All numerical results are based on 5 independent runs to capture the variability. The dark lines represent the mean of the runs, while the shaded areas around each line illustrate the range obtained by adding and subtracting one standard deviation from the mean. The maximum time duration is T = 20, 000 in our experiment since it is sufficient to capture the characteristics of the training process. Convergence behavior and two-phase phenomenon. We demonstrate through numerical simulations that our analysis aligns with the observed behaviors. For algorithms with a time-invariant stepsize, convergence requires sufficiently small stepsizes and a sufficiently large number of iterations T. To explore the impact of stepsizes on convergence, we use λ {0.9, 0.5, 0.2, 0.1, 0.05}, spanning a range within (0, 1). As shown in Figure 2a, these stepsizes are not sufficiently small, leading to a two-phase phenomenon: the ℓ -norm of t = Q Qt has a rapid decay in the first phase followed by a bounce back in the second phase. This phenomenon is distinctive to heterogeneous settings. In contrast, Figure 2b indicates that in homogeneous environments, no drastic bounce occurs, irrespective of the stepsize. Note that if multiple plots on the same page share the same legend, we display the legend only once for clarity. Figure 4a (light blue curve) demonstrates that with a sufficiently small stepsize, such as λ = 1 T , the error continuously decreases, reaching approximately 24 by iteration 20,000. A useful practice implication of our results is that: While constant stepsizes are often used in reinforcement learning problems because of the great performance in applications as described in Sutton & Barto (2018), they suffer significant performance degradation in the presence of environmental heterogeneity. (a) Heterogeneous environments E = 10. (b) Homogeneous environments E = 10. Figure 2: The ℓ error of different constant stepsizes under the heterogeneous and homogenous settings. Impacts of the synchronization period E. In homogeneous settings (refer to Figure 5 in Appendix G.1), the synchronization period E has negligible impact, consistent with prior findings in the literature (Woo et al., 2023; Khodadadian et al., 2022). However, under heterogeneous conditions, larger E values lead to increased final error across the five constant stepsizes, as depicted in Figure 3 and Figure 2a. This degradation persists even with time-decaying stepsizes λt = 1 t+1, as shown in Figure 6. We hypothesize that larger E values require either smaller or more rapidly decaying stepsizes to mitigate the degradation caused by increased synchronization periods.