Actor-Critics Can Achieve Optimal Sample Efficiency
Authors: Kevin Tan, Wei Fan, Yuting Wei
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments support our theoretical findings. We provide two numerical experiments to empirically verify our findings, with details deferred to Appendix H. The first experiment examines Algorithms 1 and 2 in a linear MDP setting, in order to validate if they indeed achieve sqrt(T) regret in practice. |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, USA. |
| Pseudocode | Yes | Algorithm 1 Double Optimistic Updates for Heavily Updating Actor-critics (DOUHUA); Algorithm 2 No-regret Optimistic Rare-switching Actor-critic (NORA); Algorithm 3 Non-Optimistic Actor-critic with Hybrid RL targeting π(t) (NOAH-π); Algorithm 4 Non-Optimistic Actor-critic with Hybrid RL targeting π (NOAH); Algorithm 5 No-regret Optimistic Rare-switching Actor-critic (NORA) for Hybrid RL |
| Open Source Code | Yes | Figure 2 can be reproduced by running actor critic.ipynb within the following Git Hub repository (https://github.com/hetankevin/hybridcov). Figure 3 can be reproduced by running scripts/run antmaze.sh within the following Git Hub repository (https://github.com/nakamotoo/Cal-QL). |
| Open Datasets | Yes | Figure 2. Per-episode reward and cumulative regret of Algorithms 1 and 2, compared to a rare-switching version of LSVI-UCB (Jin et al., 2019) on a linear MDP tetris task. ... Figure 3. Comparison of Cal-QL (Nakamoto et al., 2023), Alg. 1H, and Alg. 2H on the antmaze-medium-diverse-v2 task. |
| Dataset Splits | No | The paper does not explicitly specify exact percentages or sample counts for training, validation, or test sets. It mentions "offline pretraining in the first 1000 steps" in Figure 3 caption, but this refers to training duration rather than data splitting methodology. No citation for standard splits for the mentioned tasks is provided. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware components such as CPU, GPU models, or TPU types used for running the experiments. |
| Software Dependencies | No | The paper refers to running Python scripts and interacting with GitHub repositories in Appendix H, implying the use of Python and related machine learning libraries. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | Algorithm 2H can be reproduced by adding the flags --enable calql=False, --use cql=False, and --online use cql=False. Algorithm 1H can be reproduced with the same flags as Algorithm 2H, but additionally setting the config.cql max target backup argument within the Conservative SAC() object to False. |