Multi-Bellman operator for convergence of $Q$-learning with linear function approximation

Authors: Diogo S. Carvalho, Pedro A. Santos, Francisco S. Melo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically validate our theoretical results. In this section, we validate our theoretical findings. All results are averages of five runs with standard deviations. Given the linear function approximation setting and relatively small scale of the environments, all experiments can be performed on standard commercial CPUs, with small memory costs, and lasting less than eight hours.
Researcher Affiliation Academia Diogo S. Carvalho EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon Pedro A. Santos EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon Francisco S. Melo EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon
Pseudocode Yes We provide a pseudo-code of Multi Q-learning in Algorithm 1, which calls a function that builds the target for the update defined in Algorithm 2.
Open Source Code No No explicit statement about releasing the code for the methodology described in this paper or a link to a repository was found.
Open Datasets Yes Acrobot is a classic control problem proposed by Sutton (1995) where a joint actuates two links such that one end is fixed and the other is free. The actions of the agent are to apply a negative torque to the joint, apply a positive torque to the joint or do nothing. [...] Cartpole is a classic control problem proposed by Barto et al. (1983), where a cart balances a pole.
Dataset Splits No The paper describes experiments in reinforcement learning environments. It mentions using a 'replay buffer', which is a mechanism for storing experiences, but does not provide specific training/test/validation dataset splits in the conventional sense for static datasets.
Hardware Specification No All experiments can be performed on standard commercial CPUs, with small memory costs, and lasting less than eight hours.
Software Dependencies No The paper does not specify any particular software libraries or their version numbers used for implementation (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes We consider a discount factor of 0.9, which is within the interval where Q-learning is originally reported to diverge, and a learning rate of 10^-1, and uniform data distribution. [...] We consider a discount factor of 0.98, within the interval at which Q-learning diverges, a learning rate of 10^-2, and a data distribution that samples the first action one seventh of the times and the second action six sevenths of the times. [...] We use tabular features, with a discount factor of 0.9 and a learning rate of 0.1. [...] We use an ϵ-greedy policy where ϵ decays linearly from 100% to 5% during the first half of interactions and remains constant afterwards. We use a replay buffer with 20% of the total number of timesteps used for the environment. [...] We use a discount factor of 0.99 and a learning rate of 3 * 10^-3. [...] We use a discount factor of 0.99 and a learning rate of 3 * 10^-2.