Test-time Adapted Reinforcement Learning with Action Entropy Regularization

Authors: Shoukai Xu, Zihao Lian, Mingkui Tan, Liu Liu, Zhong Zhang, Peilin Zhao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority of our method. Our method achieved a significant improvement over CQL, with a 13.6% episode return relative increase on the hopper-expert-v2 task. To evaluate the effectiveness of Test-Time Adapted Reinforcement Learning (TARL), we conduct experiments on both discrete control and continuous control tasks.
Researcher Affiliation Collaboration 1South China University of Technology 2Tencent AI Lab 3Pazhou Laboratory 4Key Laboratory of Big Data and Intelligent Robot, Ministry of Education 5Shanghai Jiao Tong University. Correspondence to: Peilin Zhao <EMAIL>, Mingkui Tan <EMAIL>.
Pseudocode Yes Algorithm 1 Training Method for TARL
Open Source Code Yes The source code for this project is publicly available at https://github.com/xushoukai/TARL.
Open Datasets Yes Atari Benchmark. For discrete control tasks, we conduct experiments on Atari games (Bellemare et al., 2013). D4RL Benchmark. For continuous control tasks, we conduct experiments on D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (e.g., percentages or sample counts) for reproducibility. It mentions using specific D4RL datasets like 'Expert', 'Fully Replay', 'Medium Policy', 'Medium Replay Buffer', and 'Medium Expert' which are pre-defined datasets within the benchmark, but does not describe how these were further split by the authors for their experiments.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No We implement the discrete control experiments in atari following CQL (Kumar et al., 2020) and the continuous control tasks in the Offline RL-Kit codebase (Sun, 2023). While this mentions specific frameworks/codebases, it does not provide specific version numbers for Python, PyTorch, or the mentioned toolkits.
Experiment Setup Yes For the D4RL benchmark dataset with continuous control tasks, the hyperparameters used for all tasks were a learning rate of 1e 6, a buffer capacity size of 1000, and a selection of the top 10 small entropy samples to update the offline policy. The KL Divergence constraint λ was set to 1.0. For the Atari dataset with discrete control tasks, we set the hyperparameters as follows: a learning rate of 1e 9, an entropy threshold E0 of 0.1, and a KL Divergence constraint limit λ of 1.5.