reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Actor-Critics Can Achieve Optimal Sample Efficiency

Authors: Kevin Tan, Wei Fan, Yuting Wei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments support our theoretical findings. We provide two numerical experiments to empirically verify our findings, with details deferred to Appendix H. The first experiment examines Algorithms 1 and 2 in a linear MDP setting, in order to validate if they indeed achieve sqrt(T) regret in practice.
Researcher Affiliation	Academia	1Department of Statistics and Data Science, The Wharton School, University of Pennsylvania, USA.
Pseudocode	Yes	Algorithm 1 Double Optimistic Updates for Heavily Updating Actor-critics (DOUHUA); Algorithm 2 No-regret Optimistic Rare-switching Actor-critic (NORA); Algorithm 3 Non-Optimistic Actor-critic with Hybrid RL targeting π(t) (NOAH-π); Algorithm 4 Non-Optimistic Actor-critic with Hybrid RL targeting π (NOAH); Algorithm 5 No-regret Optimistic Rare-switching Actor-critic (NORA) for Hybrid RL
Open Source Code	Yes	Figure 2 can be reproduced by running actor critic.ipynb within the following Git Hub repository (https://github.com/hetankevin/hybridcov). Figure 3 can be reproduced by running scripts/run antmaze.sh within the following Git Hub repository (https://github.com/nakamotoo/Cal-QL).
Open Datasets	Yes	Figure 2. Per-episode reward and cumulative regret of Algorithms 1 and 2, compared to a rare-switching version of LSVI-UCB (Jin et al., 2019) on a linear MDP tetris task. ... Figure 3. Comparison of Cal-QL (Nakamoto et al., 2023), Alg. 1H, and Alg. 2H on the antmaze-medium-diverse-v2 task.
Dataset Splits	No	The paper does not explicitly specify exact percentages or sample counts for training, validation, or test sets. It mentions "offline pretraining in the first 1000 steps" in Figure 3 caption, but this refers to training duration rather than data splitting methodology. No citation for standard splits for the mentioned tasks is provided.
Hardware Specification	No	The paper does not explicitly mention any specific hardware components such as CPU, GPU models, or TPU types used for running the experiments.
Software Dependencies	No	The paper refers to running Python scripts and interacting with GitHub repositories in Appendix H, implying the use of Python and related machine learning libraries. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	Algorithm 2H can be reproduced by adding the flags --enable calql=False, --use cql=False, and --online use cql=False. Algorithm 1H can be reproduced with the same flags as Algorithm 2H, but additionally setting the config.cql max target backup argument within the Conservative SAC() object to False.