Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation
Authors: Jianliang He, Han Zhong, Zhuoran Yang
ICLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. Specifically, we propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP), which incorporates both model-based and value-based incarnations. In particular, LOOP features a novel construction of confidence sets and a low-switching policy updating scheme, which are tailored to the average-reward and function approximation setting. Moreover, for AMDPs, we propose a novel complexity measure average-reward generalized eluder coefficient (AGEC) which captures the challenge of exploration in AMDPs with general function approximation. Using AGEC, we prove that LOOP achieves a sublinear O(poly(d, sp(V )) Tβ) regret, where d and β correspond to AGEC and logcovering number of the hypothesis class respectively, sp(V ) is the span of the optimal state bias function, T denotes the number of steps, and O( ) omits logarithmic factors. To the best of our knowledge, this paper presents the first comprehensive theoretical framework capable of handling nearly all AMDPs. |
| Researcher Affiliation | Academia | Jianliang He Department of Statistics and Data Science Fudan University EMAIL Han Zhong Center for Data Science Peking University EMAIL Zhuoran Yang Department of Statistics and Data Science Yale University EMAIL |
| Pseudocode | Yes | Algorithm 1 Local-fitted Optimization with Optimism LOOP(H, G, T, δ) Algorithm 2 MLE-based Local-fitted Optimization with Optimism MLE-LOOP(H, T, δ) Algorithm 3 Extended Value Iteration (EVI) |
| Open Source Code | No | The paper does not contain any statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | This is a theoretical paper and does not conduct experiments on a specific dataset. Therefore, there is no mention of a publicly available dataset. |
| Dataset Splits | No | This is a theoretical paper and does not conduct experiments. Therefore, there are no training/test/validation dataset splits mentioned. |
| Hardware Specification | No | This is a theoretical paper that does not report computational experiments. Therefore, no hardware specifications are provided. |
| Software Dependencies | No | This is a theoretical paper that does not report computational experiments. Therefore, no software dependencies with version numbers are listed. |
| Experiment Setup | No | This is a theoretical paper and does not conduct experiments. Therefore, no experimental setup details like hyperparameters or training settings are provided. |