Towards General-Purpose Model-Free Reinforcement Learning

Authors: Scott Fujimoto, Pierluca D'Oro, Amy Zhang, Yuandong Tian, Michael Rabbat

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.
Researcher Affiliation Industry Scott Fujimoto, Pierluca D Oro, Amy Zhang, Yuandong Tian, Michael Rabbat Meta FAIR Correspondence: EMAIL.
Pseudocode Yes 4.2 ALGORITHM We now present the details of MR.Q (Model-based Representations for Q-learning). ...Given the transition (s,a,r,d,s ) from the replay buffer: Output MR.Q Trained end-to-end State Encoder zs = fω(s) State-Action Encoder zsa = gω(zs,a) MDP predictor zs , r, d = z sam Decoupled RL Value Qi = Qθ(zsa) Policy aπ = πϕ(zs) Update MR.Q if t % Ttarget = 0 then Target networks: θ ,ϕ ,ω θ,ϕ,ω. Reward scaling: r r, r mean Dr. for Ttarget time steps do Encoder update: Equation 14. Value update: Equation 19. Policy update: Equation 20.
Open Source Code Yes Code: https://github.com/facebookresearch/MRQ.
Open Datasets Yes We evaluate MR.Q on four widely used RL benchmarks and 118 environments... Gym Locomotion. This subset of the Gym benchmark (Brockman et al., 2016; Towers et al., 2024)... DMC Proprioceptive. The Deep Mind Control suite (DMC) (Tassa et al., 2018)... Atari. The Atari benchmark is built on the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits Yes Evaluations are based on the average performance over 10 episodes, measured every 5k time steps for Gym and DM control and every 100k time steps for Atari. Gym Locomotion...Agents are trained for 1M time steps... DMC Proprioceptive...Agents are trained for 500k time steps, equivalent to 1M frames... Atari...Agents are trained for 2.5M time steps (equivalent to 10M frames)...
Hardware Specification No No specific hardware details (like GPU/CPU models or processor types) are mentioned in the paper.
Software Dependencies Yes B.5 SOFTWARE VERSIONS Gymnasium 0.29.1 (Towers et al., 2024) Mu Jo Co 3.2.2 (Todorov et al., 2012) Num Py 2.1.1 (Harris et al., 2020) Python 3.11.8 (Van Rossum & Drake Jr, 1995) Py Torch 2.4.1 (Paszke et al., 2019)
Experiment Setup Yes Table 1: Hyperparameter differences between Rainbow (Hessel et al., 2018) and TD3 (Fujimoto et al., 2018). ... Table 3: MR.Q Hyperparameters. Hyperparameters values are kept fixed across all benchmarks.