EvoControl: Multi-Frequency Bi-Level Control for High-Frequency Continuous Control
Authors: Samuel Holt, Todor Davchev, Dhruva Tirumala, Ben Moran, Atil Iscen, Antoine Laurens, Yixin Lin, Erik Frey, Markus Wulfmeier, Francesco Romano, Nicolas Heess
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that Evo Control can achieve a higher evaluation reward for continuous-control tasks compared to existing approaches, specifically excelling in tasks where high-frequency control is needed, such as those requiring safety-critical fast reactions. Evaluation. Unless otherwise stated we train each policy (high-level ρ and low-level β) for 1M high-level steps. Post-training, we evaluate performance using 128 rollouts (different random seeds) per trained policy, calculating the return for each 1,000-step episode. We repeat this process for three training seeds per baseline. Results are reported as the mean normalized score R (Yu et al., 2020) across all 384 evaluation rollouts (3 training seeds x 128 evaluation rollouts), scaled from 0 (random policy performance) to 100 (best non-Evo Control baseline) detailed in Appendix H. Table 3: Normalized evaluation returns (R) for benchmarks trained for an equivalent number of 1M high-level (ρ) steps per environment. Evo Control consistently outperforms baseline methods (fixed controllers and direct torque control), with results averaged over 384 random seeds (95% confidence intervals shown). Scores are normalized between 0 (random policy) and 100 (best-performing non-Evo Control baseline). |
| Researcher Affiliation | Collaboration | Samuel Holt 1 Todor Davchev 2 Dhruva Tirumala 2 Ben Moran 2 Atil Iscen 2 Antoine Laurens 2 Yixin Lin 2 Erik Frey 2 Markus Wulfmeier 2 Francesco Romano 2 Nicolas Heess 2 Work done during an internship at Google Deep Mind. Equal advising. 1University of Cambridge 2Google Deep Mind. Correspondence to: Samuel Holt <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Bi-Level Policy Interaction (Single High-Level Step) 1: ak ρ(sk) {High-level action} 2: for i = 0 to G 1 do 3: uk+i β(sk+i, ak) {Low-level action} 4: sk+i+1 f(sk+i, uk+i, t) 5: end for Algorithm 2 Evo Control Training Require: Environment f(st, ut), reward function r(st, ut), high-level policy ρθ(sk), initial low-level policy βP D(si, ak), total training sections K, steps per section N, annealing strategy for α, ES parameters η, population size P, generations per section Gevo, rollouts per individual Revo. Ensure: Trained high-level policy ρθ(sk), trained low-level policy βϕ(si, ak). |
| Open Source Code | No | The paper mentions using and referencing open-source implementations for components like PPO (Pure Jax RL, Clean RL, Stable Baselines) and ES (Evo Jax, Evo Sax) but does not provide a direct link or explicit statement for the open-sourcing of the 'Evo Control' methodology's own implementation. |
| Open Datasets | Yes | Benchmark Environments. We evaluate performance on thirteen high-dimensional continuous control environments. Ten environments are adapted from standard Gym Mu Jo Co tasks (Brockman et al., 2016a; Freeman et al., 2021), including locomotion (e.g., Ant, Half Cheetah, Humanoid) and manipulation tasks (e.g., Reacher, Pusher). Crucially, we substantially modify these benchmarks by increasing the control frequency to 500Hz (with episodes lasting 1000 steps or 2 seconds of real-time) and removing the typical control-cost term. We use Brax (Freeman et al., 2021), a differentiable physics engine, which provides efficient implementations of the Ant, Half Cheetah, Hopper, Humanoid, Humanoid Standup, Inverted Double Pendulum, Pusher, Reacher, and Walker2d environments. These environments encompass a range of locomotion and manipulation tasks, providing a diverse testbed for evaluating Evo Control. For each environment, we set the simulation timestep t to 0.002 (500Hz operation). High-level policies operate at a frequency of 31.25Hz, achieved by executing each high-level action for G = 16 simulation steps. To ensure a fair comparison across different control modes, we remove the action magnitude penalization from the default reward function of each environment. The low-level policy receives the high-level action concatenated to a subset of the environment observation state as its own observation, and the exact input specification for each Evo Control variation is provided in Table 2. This allows the low-level controller to condition its actions on the target specified by the high-level policy. The low-level action space is the same as the high-level action space. All environments have a fixed episode length of low-level timesteps of 1,000 environment steps. To increase the realism of the simulation, we run the Brax environments with the backend of MJX, that is a Mu Jo Co environment in Jax with XLA. This enables us to even modify the Mu Jo Co xml definition file (to create the Safety Critical Reacher) environment. For all Mu Jo Co environments, we incorporated fixed PD controllers. We tuned the PD gains for each environment individually. Specifically, we set the proportional gain (Kp) to 1.0. This value was chosen as the environments, by default, accept actions with a magnitude of 1, representing a normalized torque input. To determine the optimal derivative gain (Kd), we leveraged Mu Jo Co s dampratio parameter, setting it to 1.0 (critically damped). We then empirically observed the Kd value that corresponds to this dampratio within the simulation. These tuned Kp and Kd values were used consistently throughout our experiments unless explicitly stated otherwise, providing a standardized and well-tuned PD baseline for comparison with Evo Control. This approach ensured that the PD controllers were appropriately configured for each environment s dynamics, providing a strong benchmark for evaluating the performance of learned low-level policies. 5The Brax continuous control environments are all publicly available from https://github.com/google/brax. |
| Dataset Splits | Yes | Evaluation. Unless otherwise stated we train each policy (high-level ρ and low-level β) for 1M high-level steps. Post-training, we evaluate performance using 128 rollouts (different random seeds) per trained policy, calculating the return for each 1,000-step episode. We repeat this process for three training seeds per baseline. Results are reported as the mean normalized score R (Yu et al., 2020) across all 384 evaluation rollouts (3 training seeds x 128 evaluation rollouts), scaled from 0 (random policy performance) to 100 (best non-Evo Control baseline) detailed in Appendix H. To reproduce this experiment, we used the Reacher 1D environment, as detailed in Appendix E.2. Specifically to investigate the efficiency of exploration, we modified the Reacher 1D environment to have a deterministic goal across new random seeds, such that the goal location is qgoal = π/2.0, and the initial starting position to q = 0. In 25% of the episodes, a randomly positioned obstacle is introduced, which the arm must avoid. A contact force sensor is added to the observations, and a penalty is applied to the reward for any contact force exceeding a threshold. This encourages the development of low-level controllers capable of reacting quickly to avoid collisions. |
| Hardware Specification | Yes | All experiments were run on a NVIDIA H100 GPU, with 80GB VRAM with a 40 core CPU with 256 RAM. |
| Software Dependencies | No | We use the standard PPO implementation (Schulman et al., 2017), specifically using the implementation from Pure Jax RL7 (Lu et al., 2022), a Jax (Bradbury et al., 2021) implementation of PPO. We used the fixed PPO hyper-parameters from Pure Jax RL, which are derived from the PPO continuous-control environment parameters from Clean RL (Huang et al., 2022) which are themselves derived from those from stable baselines (Raffin et al., 2021). These hyper-parameters have been determined to provide good performance across a range of continuous-control environments. We use the implementation of PGPE provided by Evo Jax (Tang et al., 2022), in Jax, and their recommended hyper-parameters for PGPE, which were empirically found to work well for continuous control tasks. |
| Experiment Setup | Yes | We used the fixed PPO hyper-parameters from Pure Jax RL, which are derived from the PPO continuous-control environment parameters from Clean RL (Huang et al., 2022) which are themselves derived from those from stable baselines (Raffin et al., 2021). These parameters are specifically learning_rate =3e-4, num_envs =1024, num_steps =10 (number of environment steps per rollout), total_timesteps =1e6, update_epochs =4 (number of PPO update epochs per iteration), num_minibatches =8 (number of minibatches for each PPO update), gamma =0.99 (discount factor), gae_lambda =0.95 (generalized Advantage Estimation parameter), clip_eps =0.2, ent_coef =0.0, vf_coef =0.5, and max_grad_norm =0.5 (gradient clipping threshold). For ES we use Policy Gradients with Parameter-Based Exploration (PGPE) (Sehnke et al., 2010) algorithm to optimize the low-level policy βϕNN. The neural network s parameter vector ϕ is directly optimized. We use a population size of es_pop_size = 512, and each individual is evaluated over es_rollouts = 16 rollouts to estimate its fitness (episodic return R). Adam (Kingma & Ba, 2014) is used within PGPE, and we we use the PGPE hyper-parameters of a center learning rate of 0.05 and a standard deviation learning rate of 0.1. We use es_sub_generations = 8 generations per training section k. The parameter distribution s initial standard deviation is 0.1. We use the implementation of PGPE provided by Evo Jax (Tang et al., 2022), in Jax, and their recommended hyper-parameters for PGPE, which were empirically found to work well for continuous control tasks. Furthermore, we set K = 8 per 1M high-level ρ steps used to train the high-level policy for, and this was empirically determined to work well in practice. |