Control-oriented Clustering of Visual Latent Representation

Authors: Han Qi, Haocheng Yin, Heng Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the prevalent emergence of a similar law of clustering in the visual representation space. Particularly, we pretrain the vision encoder using NC as a regularization to encourage control-oriented clustering of the visual features. Surprisingly, such an NC-pretrained vision encoder, when finetuned end-to-end with the action decoder, boosts the test-time performance by 10% to 35%. Real-world vision-based planar pushing experiments confirmed the surprising advantage of control-oriented visual representation pretraining.
Researcher Affiliation Academia 1School of Engineering and Applied Sciences, Harvard University 2Department of Computer Science, ETH Zürich
Pseudocode No The paper describes methods and processes in detail, but does not present any formal pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code No The paper does not provide explicit links to its own source code repository or a clear statement about releasing its code. While it links to a general lab page ("https://computationalrobotics.seas.harvard.edu/Control_Oriented_NC"), this is not an explicit code repository for the described methodology. It also references third-party libraries like "rl_zoo3 (Raffin, 2020)" but this is not the authors' own implementation code for their methodology.
Open Datasets Yes We study the instantiation of this architecture in three tasks: Lunar Lander from Open AI Gym (Brockman, 2016), Planar Pushing that is popular in robotics (Chi et al., 2023), and Block Stacking from Mimic Gen (Mandlekar et al., 2023). ... block stacking is implemented as one of the manipulation tasks in Mimic Gen dataset using robosuite framework (Zhu et al., 2020) backended on Mu Jo Co (Todorov et al., 2012). To train the behavior cloning pipeline, we used the dataset core stack_d0 provided by Mimic Gen which contains N = 1000 demos as our expert demonstrations. ... we created another visually challenging setup where 11 distracting objects and 2 well-known paintings are placed in the workspace (see Fig. 10(b)). For this challenging setup, we collected 200 expert demonstrations because 100 demonstrations were insufficient to train robust policies.
Dataset Splits Yes We collect 3N samples (a.k.a. expert demonstrations) from the optimal policy π (x); each batch of N samples have label +1, 0, and 1, respectively. We use N = 5000. ... We collect N = 500 expert demonstration on a push-T setup... This provides M = 55, 480 training samples. ... We define our evaluation metric as the ratio of the overlapping area between the object and the target to the total area of the target. ... The four trained models are evaluated on a test push-T dataset with 100 tasks. ... we used the dataset core stack_d0 provided by Mimic Gen which contains N = 1000 demos as our expert demonstrations. This provides M = 107, 590 training samples. ... We tested both models on a set of 10 new push-T tasks where the T block is initialized at positions not seen during training. ... We trained two policies using only 50 demonstrations. Fig. 26 and 27 show the results for all 10 tests.
Hardware Specification No The paper mentions "Franka Panda robotic arm" and "Intel Real Sense camera" as part of the real-world setup, but these are components of the robotic system or sensing, not the computational hardware used for training or inference. It also mentions "GPU limitation" but no specific GPU models or other computational hardware specifications are provided.
Software Dependencies Yes We used the trained PPO policy to collect 500 demos as our expert training dataset. We applied the default optimal network architecture and training hyperparameters from the popular RL training library rl_zoo3 (Raffin, 2020) to train the PPO. ... computer interface provided by pymunk (Blomqvist, 2024)). This provides M = 55, 480 training samples. We train four different instantiations of the image-based control pipeline: using Res Net or DINOv2 as the vision encoder, and Diffusion Model (DM) or LSTM as the action decoder. ... The block stacking is implemented as one of the manipulation tasks in Mimic Gen dataset using robosuite framework (Zhu et al., 2020) backended on Mu Jo Co (Todorov et al., 2012).
Experiment Setup Yes We design a six-layer MLP (2 64 64 64 64 3 3) and train it with the cross-entropy loss. We use N = 5000. For the u = 0 class, we repeat N times the sample x = 0 to balance the dataset. ... We use a Res Net18 model as the vision encoder and use a MLP model to map the 512 dimensional latent embedding into 64 dimensional. We use a sequence of 4 input images as observation input and predict the next action. The action decoder is structured as another MLP network with 6 layers and maps the image latent embeddings to predicted action, then calculating the cross-entropy loss with the ground truth discrete actions. We train the model for 600 epochs. ... The four trained models are evaluated on a test push-T dataset with 100 tasks. ... We firstly pretrain the same Res Net using NC regularization (i.e., minimize the NC metrics and encourage the visual features to cluster according to control-oriented labels) for 50 epochs, and then end-to-end train pretrained Res Net+DM pipeline from expert demonstrations for another 50 epochs. During NC regularized pretraining, because all the data samples cannot be fit into one batch due to GPU limitation, we use the largest possible batch size and divide the randomly shuffled dataset into 4 batches in every pretraining epoch. We calculate the NC regularization for each batch following this NC loss: 0.1 CDNV + 10 (STDNorm + STDAngle).