reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-time Adapted Reinforcement Learning with Action Entropy Regularization

Authors: Shoukai Xu, Zihao Lian, Mingkui Tan, Liu Liu, Zhong Zhang, Peilin Zhao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority of our method. Our method achieved a significant improvement over CQL, with a 13.6% episode return relative increase on the hopper-expert-v2 task. To evaluate the effectiveness of Test-Time Adapted Reinforcement Learning (TARL), we conduct experiments on both discrete control and continuous control tasks.
Researcher Affiliation	Collaboration	1South China University of Technology 2Tencent AI Lab 3Pazhou Laboratory 4Key Laboratory of Big Data and Intelligent Robot, Ministry of Education 5Shanghai Jiao Tong University. Correspondence to: Peilin Zhao <EMAIL>, Mingkui Tan <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training Method for TARL
Open Source Code	Yes	The source code for this project is publicly available at https://github.com/xushoukai/TARL.
Open Datasets	Yes	Atari Benchmark. For discrete control tasks, we conduct experiments on Atari games (Bellemare et al., 2013). D4RL Benchmark. For continuous control tasks, we conduct experiments on D4RL benchmark (Fu et al., 2020).
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts) for reproducibility. It mentions using specific D4RL datasets like 'Expert', 'Fully Replay', 'Medium Policy', 'Medium Replay Buffer', and 'Medium Expert' which are pre-defined datasets within the benchmark, but does not describe how these were further split by the authors for their experiments.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	We implement the discrete control experiments in atari following CQL (Kumar et al., 2020) and the continuous control tasks in the Offline RL-Kit codebase (Sun, 2023). While this mentions specific frameworks/codebases, it does not provide specific version numbers for Python, PyTorch, or the mentioned toolkits.
Experiment Setup	Yes	For the D4RL benchmark dataset with continuous control tasks, the hyperparameters used for all tasks were a learning rate of 1e 6, a buffer capacity size of 1000, and a selection of the top 10 small entropy samples to update the offline policy. The KL Divergence constraint λ was set to 1.0. For the Atari dataset with discrete control tasks, we set the hyperparameters as follows: a learning rate of 1e 9, an entropy threshold E0 of 0.1, and a KL Divergence constraint limit λ of 1.5.