Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

Authors: Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, Matteo Pirotta

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of FB-CPR in a challenging humanoid control problem. Training FB-CPR online with observation-only motion capture datasets, we obtain the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.1
Researcher Affiliation Collaboration 1 Fundamental AI Research at Meta, 2 Mila, Mc Gill University, 3 UCL EMAIL
Pseudocode Yes In Alg. 1 we provide a detailed pseudo-code of FB-CPR including how all losses are computed. Following Touati et al. (2023), we add two regularization losses to improve FB training: an orthonormality loss pushing the covariance ΣB = E[B(s)B(s) ] of B towards the identity, and a temporal difference loss pushing F(s, a, z) z toward the action-value function of the corresponding reward B(s) Σ 1 B z. The former is helpful to make sure that B is well-conditioned and does not collapse, while the latter makes F spend more capacity on the directions in z space that matter for policy optimization.
Open Source Code Yes 1Code, models, and an interactive demo are available at https://metamotivo.metademolab.com.
Open Datasets Yes we use the AMASS dataset (Mahmood et al., 2019), a large collection of uncurated motion capture data, for regularization.
Dataset Splits Yes After a 10% train-test split, we obtained a train dataset M of 8902 motions and a test dataset MTEST of 990 motions, with a total duration of approximately 29 hours and 3 hours, respectively (see Tab. 2 in App. C.2).
Hardware Specification No No specific hardware details (like GPU/CPU models or types) are provided in the paper for the experimental setup.
Software Dependencies No The paper mentions software like Mu Jo Co and dm_control, and uses algorithms such as TD3 and Adam optimizer, but does not provide specific version numbers for these software dependencies. For example, 'The simulation is performed using Mu Jo Co (Todorov et al., 2012) at 450 Hz, while the control frequency is 30 Hz.' and 'Unless otherwise stated we use the Adam optimizer (Kingma & Ba, 2015)'.
Experiment Setup Yes We use a replay buffer of capacity 5M transitions and update agents by sampling mini-batches of 1024 transitions. During online training, we interleave a rollout phase, where we collect 500 transitions across 50 parallel environments, with a model update phase, where we update each network 50 times. The paper also includes Table 3 'Summary of general training parameters,' which specifies 'Number of environment steps 30M' and 'Discount factor 0.98,' and Table 9 'Hyperparameters used for FB-CPR pretraining,' detailing parameters like 'z dimension d 256' and 'Learning rate for F 10-4'.