Latent Action Learning Requires Supervision in the Presence of Distractors
Authors: Alexander Nikulin, Ilya Zisman, Denis Tarasov, Lyubaykin Nikita, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions. |
| Researcher Affiliation | Collaboration | 1AIRI 2MIPT 3Skoltech 4Research Center for Trusted Artificial Intelligence, ISP RAS 5Innopolis University 6Accenture. Correspondence to: Alexander Nikulin <EMAIL>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes methods in text and uses architectural diagrams like Figure 4. |
| Open Source Code | Yes | We open-source the code at https://github.com/dunnolab/laom. |
| Open Datasets | Yes | As currently existing benchmarks with distractors (Stone et al., 2021; Ortiz et al., 2024) are not yet solved, we collect new datasets with custom difficulty, based on Distracting Control Suite (DCS) (Stone et al., 2021). DCS uses dynamic background videos, camera shaking and agent color change as distractors (see Figure 2 for visualization). The complexity is determined by the number of videos as well as the scale for the magnitude of the camera and the color change. We empirically found that using 60 videos and a scale of 0.1 is the hardest setting when BC can still recover expert performance. We collect datasets with five thousand trajectories for four tasks: cheetah-run, walker-run, hopperhop and humanoid-walk, listed in the order of increasing difficulty. See Appendix C for additional details. The datasets will be released together with the main code repository. |
| Dataset Splits | Yes | For each environment, we collected 5k trajectories, with an additional 50 trajectories for evaluation with novel distractor videos (from the evaluation set in the DCS). To access scaling properties with different budgets of real actions, similar to Schmidt & Jiang (2023), we repeat this process for a variable number of labeled trajectories, from 2 to 128. |
| Hardware Specification | Yes | All experiments were run on H100 GPUs, in single-gpu mode and Py Torch bf16 precision with AMP. |
| Software Dependencies | No | All experiments were run on H100 GPUs, in single-gpu mode and Py Torch bf16 precision with AMP. For the visual encoder, we used Res Nets from the open-source LAPO (Schmidt & Jiang, 2023) codebase, which also borrowed from baselines originally provided as part of the Proc Gen 2020 competition. For the action decoder, we used a two-layer MLP with 256 hidden dimensions and Re LU activations. In contrast to the commonly used cosine similarity, we used MSE for temporal consistency loss. We also found that projection heads degraded performance, so we did not use them. We use slightly non-standard MLP for latent IDM and FDM: we compose it from multiple MLP blocks inspired by Transformer architecture (Vaswani, 2017) and condition on latent action and observation representation on all layers instead of just the first. We have found that this greatly improves prediction, especially for latent actions. We also use Re LU6 activations instead of GELU, as it naturally bounds the activations, which helps with stability during training, similar to target networks in RL (Bhatt et al., 2019). Without supervision, we use the EMA target encoder. With supervision, we find that a simple stop-grad is sufficient to prevent any signs of collapse, a finding also reported by Schwarzer et al. (2020). For all experiments we use the cosine learning late schedule with warmup. For hyperparameters see Appendix F. We open-source the code at https://github.com/dunnolab/laom. The text mentions libraries and frameworks like PyTorch, Clean RL, stablebaselines3, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | On hyperparameters tuning. We tune the hyperparameters based on online performance for BC, on MSE to real actions on the full dataset for IDM, and on final linear probe MSE to real actions for latent action learning. In more practical tasks, we usually do not have this luxury, but since we are interested in estimating the upper bound performance of each method in a controlled setting, we believe that it is appropriate. For exact hyperparameters see Appendix F. F. Hyperparameters Table 5. LAPO hyperparameters. We use the same hyperparameters for all experiments and explicitly mention any exceptions. Names are exactly follow the configuration files used in code. Stage Parameter Value Latent actions learning grad norm None batch size 512 num epochs 10 frame stack 3 encoder deep False weight decay None encoder scale 6 learning rate 0.0001 warmup epochs 3 future obs offset 10 latent action dim 8192 encoder num res blocks 2 Latent behavior cloning dropout 0.0 use aug False batch size 512 num epochs 10 frame stack 3 encoder deep False weight decay None encoder scale 32 learning rate 0.0001 warmup epochs 0 encoder num res blocks 2 Latent actions decoding use aug False batch size 512 hidden dim 256 weight decay None eval episodes 25 learning rate 0.0003 total updates 2500 warmup epochs 0.0 |