reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Bellman operator for convergence of $Q$-learning with linear function approximation

Authors: Diogo S. Carvalho, Pedro A. Santos, Francisco S. Melo

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we empirically validate our theoretical results. In this section, we validate our theoretical findings. All results are averages of five runs with standard deviations. Given the linear function approximation setting and relatively small scale of the environments, all experiments can be performed on standard commercial CPUs, with small memory costs, and lasting less than eight hours.
Researcher Affiliation	Academia	Diogo S. Carvalho EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon Pedro A. Santos EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon Francisco S. Melo EMAIL INESC-ID Instituto Superior Técnico, University of Lisbon
Pseudocode	Yes	We provide a pseudo-code of Multi Q-learning in Algorithm 1, which calls a function that builds the target for the update defined in Algorithm 2.
Open Source Code	No	No explicit statement about releasing the code for the methodology described in this paper or a link to a repository was found.
Open Datasets	Yes	Acrobot is a classic control problem proposed by Sutton (1995) where a joint actuates two links such that one end is fixed and the other is free. The actions of the agent are to apply a negative torque to the joint, apply a positive torque to the joint or do nothing. [...] Cartpole is a classic control problem proposed by Barto et al. (1983), where a cart balances a pole.
Dataset Splits	No	The paper describes experiments in reinforcement learning environments. It mentions using a 'replay buffer', which is a mechanism for storing experiences, but does not provide specific training/test/validation dataset splits in the conventional sense for static datasets.
Hardware Specification	No	All experiments can be performed on standard commercial CPUs, with small memory costs, and lasting less than eight hours.
Software Dependencies	No	The paper does not specify any particular software libraries or their version numbers used for implementation (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	We consider a discount factor of 0.9, which is within the interval where Q-learning is originally reported to diverge, and a learning rate of 10^-1, and uniform data distribution. [...] We consider a discount factor of 0.98, within the interval at which Q-learning diverges, a learning rate of 10^-2, and a data distribution that samples the first action one seventh of the times and the second action six sevenths of the times. [...] We use tabular features, with a discount factor of 0.9 and a learning rate of 0.1. [...] We use an ϵ-greedy policy where ϵ decays linearly from 100% to 5% during the first half of interactions and remains constant afterwards. We use a replay buffer with 20% of the total number of timesteps used for the environment. [...] We use a discount factor of 0.99 and a learning rate of 3 * 10^-3. [...] We use a discount factor of 0.99 and a learning rate of 3 * 10^-2.