Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems
Authors: Hongyuan Su, Yu Zheng, Yuan Yuan, Yuming Lin, Depeng Jin, Yong Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of Ada Re Mo, we conduct extensive experiments across three challenging real-world scenarios molecular generation, epidemic control, and spatial planning. All these tasks involve expensive-to-evaluate reward functions, typically requiring 1 to 15 seconds per sample, resulting in prohibitively long training times for convergence with traditional methods. Results show that Ada Re Mo not only achieves state-of-the-art performance with over 14.6% improvements over existing approaches, but more importantly, it enables highly efficient RL training, delivering a remarkable speedup of over 1,000 times. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China 2Zhongguancun Academy, Beijing, China 3Massachusetts Institute of Technology, Cambridge, MA USA. Correspondence to: Yu Zheng <yu EMAIL>, Yong Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Training Process of Online and Offline System |
| Open Source Code | Yes | Code and data for the project are provided at https://github.com/tsinghua-fib-lab/Ada Re Mo. |
| Open Datasets | Yes | We utilize large-scale real-world contact networks CAGr Qc (Rossi & Ahmed, 2015) and SNAP (Leskovec & Krevl, 2014), which are extensively studied in epidemiological research. |
| Dataset Splits | No | The paper mentions collecting 'samples' for the memory pool and fine-tune pool, and performing simulations with multiple seeds, but does not specify explicit training/test/validation splits for any dataset used for model training or evaluation in the traditional sense. For example, it mentions '20 different seeds (3 for synchronous correction)' for SIR model simulations, which relates to experiment repetitions rather than dataset splits. |
| Hardware Specification | No | The paper discusses parallel computation and the use of 'multi-threaded programming' and 'multiple processors or GPUs' in a general sense, but does not provide specific details on the hardware (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific algorithms and models like PPO, SIR model, AutoDock Vina, and SAND, but does not specify the version numbers of any software libraries, programming languages (e.g., Python, PyTorch), or other ancillary software components used for implementation or experimentation. |
| Experiment Setup | Yes | We first explored different fine-tuning intervals, ranging from 1 to 9 iterations, and trained the RM accordingly. ... The optimal solution was found with 40 fine-tuning epochs, matching the duration of a single policy iteration and demonstrating efficient time utilization. ... the parameters of the SIR model are set with an infectious rate β = 0.08 and a recovery rate γ = 0.2, informed by real-world pandemic propagation (Yu et al., 2021b). |