EARL-BO: Reinforcement Learning for Multi-Step Lookahead, High-Dimensional Bayesian Optimization
Authors: Mujin Cheon, Jay H Lee, Dong-Yeun Koh, Calvin Tsay
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method, EARL-BO (Encoder Augmented RL for BO), on synthetic benchmark functions and hyperparameter tuning problems, finding significantly improved performance compared to existing multi-step lookahead and high-dimensional BO methods. Comprehensive evaluations of EARL-BO across both synthetic benchmark functions and real-world hyperparameter tuning, against other multi-step lookahead and high-dimensional optimization methods. |
| Researcher Affiliation | Academia | 1Department of Computing, Imperial College London, UK 2Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science & Technology (KAIST), South Korea 3Mork Family Department of Chemical Engineering and Materials Science, University of Southern California, USA. Correspondence to: Calvin Tsay <EMAIL>. |
| Pseudocode | Yes | EARL-BO Summary (Algorithm 1). The EARL-BO algorithm implements a hybrid model-based and model-free RL approach, loosely following the Dyna framework (Silver et al., 2008; Wu et al., 2023). ... Algorithm 1 EARL-BO Input: data Dk, action bounds [lb, ub] Parameters: lookahead horizon, max episodes, update episodes, off policy episodes Output: next query point xt+1 Initialize RL agent (PPO agent), encoder network, and memory buffer Fit GP to Dk for k = 1 to max episodes do Reset environment state s with Dk for step = 1 to lookahead horizon do Encode state s using encoder network if k off policy episodes then Select action a using Tu RBO acquisition else Select action a using RL agent end if Sample yk+1 N µk(x; Dk), Kk(x, x; Dk) Compute reward r = R(Dk, xk+1, Dk+1) using yk+1 Update environment state s = Dk+1 Store transition (s, a, s , r) to memory buffer s s end for if kmod update episodes = 0 then if k off policy episodes then Update RL agent with initial policy else Calculate PPO loss using memory buffer Train actor, critic, and encoder networks end if end if Clear memory buffer end for Encode state using final encoder network Return xk+1 output from final actor network |
| Open Source Code | No | The paper only mentions the use of third-party implementations for baselines (e.g., 'For EI, Random, and Rollout VR, we use implementations from (Lee et al., 2020) found at: https://github.com/erichanslee/lookahead_release. For Tu RBO, we use the current implementation from the Uber research group: https://github.com/uber-research/Tu RBO.') but does not provide any link or explicit statement about releasing the source code for the proposed EARL-BO method. |
| Open Datasets | Yes | We next evaluate EARL-BO in real-world scenarios using the Hyperparameter Optimization Benchmarks (HPO-B) dataset (Arango et al., 2021). The real-world Hyperparameter Optimization dataset is sourced from the HPO-B dataset (Arango et al., 2021), a collection of HPO datasets grouped by search space and tasks. |
| Dataset Splits | Yes | We initialize the BO algorithms using 30 random points within the search space and evaluate performance using simple regret (yopt y k). Each method is tested for ten replications by resampling the initial dataset. We initialize the BO algorithms using five random points for 6- and 8-D and 50 for the 19-D problem. |
| Hardware Specification | Yes | We conducted our experiments on a computing server with AMD EPYC 7742 processors equipped. The specific allocation for each job was as follows: 16 CPUs and max memory of 100 GB. |
| Software Dependencies | No | The paper mentions software components and algorithms such as PPO and Adam, but does not provide specific version numbers for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | Table 1 displays a comprehensive list of hyperparameters for EARL-BO. We would like to underline that none of these presented values for hyperparameters were tuned across problems. In other words, across various dimensions and function forms, we have kept the same hyperparameters with most basic PPO and Encoder values. ... Table 1. EARL-BO hyperparameter values. Learning rate 0.001 # epochs 100 Epsilon clip ϵ 0.2 β values for Adam (0.9, 0.999) Discount factor γ 0.95 Value function coefficient 0.5 Entropy coefficient 0.1 # layers frozen 2 Max episodes 4000 Update frequency 50 # off-policy episodes 400 No-improvement threshold 15 Horizon 5 Hidden dimension 64 Output dimension 16 Learning rate 0.01 Kernel RBF + White Kernel RBF length-scale bounds (1e-2, 1e2) Noise bounds (1e-10, 1e1) |