Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
Authors: Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we take a step towards this goal by procedurally generating tens of millions of 2D physics-based tasks and using these to train a general reinforcement learning (RL) agent for physical control. ... Our trained agent exhibits strong physical reasoning capabilities in 2D space, being able to zero-shot solve unseen human-designed environments. Furthermore, fine-tuning this general agent on tasks of interest shows significantly stronger performance than training an RL agent tabula rasa. This includes solving some environments that standard RL training completely fails at. |
| Researcher Affiliation | Academia | Michael Matthews Michael Beukman Chris Lu Jakob Foerster FLAIR, University of Oxford |
| Pseudocode | Yes | Algorithm 1 Jax2D main engine loop. 1: while true do 2: Apply gravity 3: Calculate collision manifolds (Appendices A.3.1, A.3.2, A.3.3 and A.3.4) 4: Apply motors (Appendix A.5) 5: Apply thrusters (Appendix A.6) 6: if warm starting then 7: Apply warm starting collision impulses (Appendix A.7) 8: Apply warm starting joint impulses (Appendix A.7) 9: end if 10: for i = 1 to num solver steps do 11: Apply joint constraints (Appendices A.2 and A.4) 12: Apply collision constraints (Appendices A.2 and A.3.5) 13: end for 14: Euler step position and rotation 15: end while |
| Open Source Code | Yes | 1We provide full code and models at https://kinetix-env.github.io. 2https://github.com/Michael TMatthews/Jax2D |
| Open Datasets | Yes | We provide the capability to sample random levels from the vast space of possible physics tasks, as well as providing a large set of 74 interpretable handmade levels. |
| Dataset Splits | Yes | We train on programatically generated Kinetix levels drawn from the statically defined distribution. We refer to training on sampled levels from this distribution as DR. Our main metric of assessment is the solve rate on the set of handmade holdout levels. The agent does not train on these levels but they do exist inside the support of the training distribution. |
| Hardware Specification | Yes | For all comparisons we use a single NVIDIA L40S GPU, on a server with two AMD EPYC 9554 64-Core CPUs. |
| Software Dependencies | No | The paper mentions software like JAX (Bradbury et al., 2018) and Pure Jax RL-style training (Lu et al., 2022), and algorithms like PPO (Schulman et al., 2017). However, it does not provide specific version numbers for these or any other software dependencies crucial for replication. |
| Experiment Setup | Yes | Hyperparameters are detailed in Appendix H. Table 7: Learning Hyperparameters. Parameter Value Env Frame Skip 2 PPO γ 0.995 λGAE 0.9 PPO number of steps 256 PPO epochs 8 PPO minibatches per epoch 32 PPO clip range 0.02 PPO # parallel environments 2048 Adam learning rate 5e-5 Anneal LR no PPO max gradient norm 0.5 PPO value clipping yes return normalisation no value loss coefficient 0.5 entropy coefficient 0.01 Model Fully-connected dimension size 128 Fully-connected layers 5 Transformer layers 2 Transformer Encoder Size 128 Transformer Size 16 Number of heads 8 SFL Batch Size N 12288 Rollout Length L 512 Update Period T 128 Buffer Size K 1024 Sample Ratio ρ 0.5 |