Actor-Critics Can Achieve Optimal Sample Efficiency

Authors: Kevin Tan, Wei Fan, Yuting Wei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments support our theoretical findings. We provide two numerical experiments to empirically verify our findings, with details deferred to Appendix H. The first experiment examines Algorithms 1 and 2 in a linear MDP setting, in order to validate if they indeed achieve sqrt(T) regret in practice.
Researcher Affiliation Academia 1Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, USA.
Pseudocode Yes Algorithm 1 Double Optimistic Updates for Heavily Updating Actor-critics (DOUHUA); Algorithm 2 No-regret Optimistic Rare-switching Actor-critic (NORA); Algorithm 3 Non-Optimistic Actor-critic with Hybrid RL targeting π(t) (NOAH-π); Algorithm 4 Non-Optimistic Actor-critic with Hybrid RL targeting π (NOAH); Algorithm 5 No-regret Optimistic Rare-switching Actor-critic (NORA) for Hybrid RL
Open Source Code Yes Figure 2 can be reproduced by running actor critic.ipynb within the following Git Hub repository (https://github.com/hetankevin/hybridcov). Figure 3 can be reproduced by running scripts/run antmaze.sh within the following Git Hub repository (https://github.com/nakamotoo/Cal-QL).
Open Datasets Yes Figure 2. Per-episode reward and cumulative regret of Algorithms 1 and 2, compared to a rare-switching version of LSVI-UCB (Jin et al., 2019) on a linear MDP tetris task. ... Figure 3. Comparison of Cal-QL (Nakamoto et al., 2023), Alg. 1H, and Alg. 2H on the antmaze-medium-diverse-v2 task.
Dataset Splits No The paper does not explicitly specify exact percentages or sample counts for training, validation, or test sets. It mentions "offline pretraining in the first 1000 steps" in Figure 3 caption, but this refers to training duration rather than data splitting methodology. No citation for standard splits for the mentioned tasks is provided.
Hardware Specification No The paper does not explicitly mention any specific hardware components such as CPU, GPU models, or TPU types used for running the experiments.
Software Dependencies No The paper refers to running Python scripts and interacting with GitHub repositories in Appendix H, implying the use of Python and related machine learning libraries. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes Algorithm 2H can be reproduced by adding the flags --enable calql=False, --use cql=False, and --online use cql=False. Algorithm 1H can be reproduced with the same flags as Algorithm 2H, but additionally setting the config.cql max target backup argument within the Conservative SAC() object to False.