reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems

Authors: Hongyuan Su, Yu Zheng, Yuan Yuan, Yuming Lin, Depeng Jin, Yong Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness of Ada Re Mo, we conduct extensive experiments across three challenging real-world scenarios molecular generation, epidemic control, and spatial planning. All these tasks involve expensive-to-evaluate reward functions, typically requiring 1 to 15 seconds per sample, resulting in prohibitively long training times for convergence with traditional methods. Results show that Ada Re Mo not only achieves state-of-the-art performance with over 14.6% improvements over existing approaches, but more importantly, it enables highly efficient RL training, delivering a remarkable speedup of over 1,000 times.
Researcher Affiliation	Academia	1Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China 2Zhongguancun Academy, Beijing, China 3Massachusetts Institute of Technology, Cambridge, MA USA. Correspondence to: Yu Zheng <yu EMAIL>, Yong Li <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training Process of Online and Offline System
Open Source Code	Yes	Code and data for the project are provided at https://github.com/tsinghua-fib-lab/Ada Re Mo.
Open Datasets	Yes	We utilize large-scale real-world contact networks CAGr Qc (Rossi & Ahmed, 2015) and SNAP (Leskovec & Krevl, 2014), which are extensively studied in epidemiological research.
Dataset Splits	No	The paper mentions collecting 'samples' for the memory pool and fine-tune pool, and performing simulations with multiple seeds, but does not specify explicit training/test/validation splits for any dataset used for model training or evaluation in the traditional sense. For example, it mentions '20 different seeds (3 for synchronous correction)' for SIR model simulations, which relates to experiment repetitions rather than dataset splits.
Hardware Specification	No	The paper discusses parallel computation and the use of 'multi-threaded programming' and 'multiple processors or GPUs' in a general sense, but does not provide specific details on the hardware (e.g., GPU model, CPU type, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using specific algorithms and models like PPO, SIR model, AutoDock Vina, and SAND, but does not specify the version numbers of any software libraries, programming languages (e.g., Python, PyTorch), or other ancillary software components used for implementation or experimentation.
Experiment Setup	Yes	We first explored different fine-tuning intervals, ranging from 1 to 9 iterations, and trained the RM accordingly. ... The optimal solution was found with 40 fine-tuning epochs, matching the duration of a single policy iteration and demonstrating efficient time utilization. ... the parameters of the SIR model are set with an infectious rate β = 0.08 and a recovery rate γ = 0.2, informed by real-world pandemic propagation (Yu et al., 2021b).