Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory
Authors: Yufeng Zhang, Qi Cai, Zhuoran Yang, Yongxin Chen, Zhaoran Wang
NeurIPS 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We prove that, utilizing an overparameterized two-layer neural network, temporaldifference and Q-learning globally minimize the mean-squared projected Bellman error at a sublinear rate. Moreover, the associated feature representation converges to the optimal one, generalizing the previous analysis of [21] in the neural tangent kernel regime, where the associated feature representation stabilizes at the initial one. The key to our analysis is a mean-field perspective |
| Researcher Affiliation | Academia | Yufeng Zhang Northwestern University Evanston, IL 60208 EMAIL Qi Cai Northwestern University Evanston, IL 60208 EMAIL Zhuoran Yang Princeton University Princeton, NJ 08544 EMAIL Yongxin Chen Georgia Institute of Technology Atlanta, GA 30332 EMAIL Zhaoran Wang Northwestern University Evanston, IL 60208 EMAIL |
| Pseudocode | Yes | For an initial distribution 0 2 P(RD), we initialize { i}m i.i.d. 0 (i 2 [m]). See Algorithm 1 in A for a detailed description. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments or use datasets, thus no information about public dataset availability is provided. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments with datasets, thus no information about training/validation/test splits is provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any experiments that would require hardware specifications. |
| Software Dependencies | No | The paper is theoretical and does not describe an experimental setup that would involve software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations. |