Student-Informed Teacher Training

Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach, we first test it in a maze setting where the teacher can choose a shortcut that is invisible to the student. Our method successfully adjusts the behavior of the teacher to take a sub-optimal route, accounting for the limited observability of the student. Additionally, we apply our framework for training a robot manipulator to open a drawer while minimizing self-occlusion in front of a camera. Finally, we demonstrate the effectiveness of our method in the complex task of quadrotor flight through obstacles. Overall, our method leads to substantially higher student returns, reducing the gap between the teacher and student across tasks. As a result, our approach leads to very notable improvements in the student success rate in all considered environments. We compare our method to multiple baselines introduced below. We evaluate our student-informed teacher training framework on three diverse tasks: maze navigation in a tabular setting, vision-based obstacle avoidance with a quadrotor, and vision-based drawer opening using a robot arm. Finally, we perform ablations to understand the role of different aspects of our method.
Researcher Affiliation Academia Nico Messikommer*, Jiaxu Xing*, Elie Aljalbout, Davide Scaramuzza Robotics and Perception Group, University of Zurich, Switzerland *Equal contribution
Pseudocode Yes A pseudocode description of the different training phases with the crucial steps is provided in Algorithm 1. Algorithm 1: Student-Informed Teacher Training
Open Source Code No The project website is at https://rpg.ifi.uzh.ch/sitt/
Open Datasets No The paper describes custom environments built using public frameworks (Gym, Flightmare, Omniverse Isaac Gym) rather than specifying and providing access information for pre-existing, publicly available datasets. For instance, the 'Color Maze' environment was 'implemented... using the Gym framework', and the quadrotor task uses a 'customized RL training environment using Flightmare'.
Dataset Splits No The paper describes dynamic data generation through parallel environments and training runs, such as 'The training setup involves 1,000 parallel environments' and '256 runs with random initialization, each running for 1000 steps'. It does not specify fixed training/test/validation splits for a static dataset, as the environments generate data on the fly.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or processor types used for running the experiments.
Software Dependencies No The paper mentions software frameworks like 'Gym framework (Brockman et al., 2016)', 'Stable Baselines3 (Raffin et al., 2021) based on Pytorch (Paszke et al., 2017)', and 'RL-Games framework (Makoviichuk & Makoviychuk, 2021)'. However, it does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup Yes In the policy update phase, the teacher network is updated using a batch containing all 250,000 collected experiences. In addition to the default PPO gradients, we apply the KL-Divergence gradient with a weighting coefficient of 0.001. The entropy coefficient for PPO is set to 0.3. During each paired alignment step, the observations stored in the roll-out buffer are used as a single batch to update the imitated student and student networks over 20 iterations. The networks are optimized using the L1 loss between corresponding features, with network weights updated via the ADAM optimizer (Kingma & Ba, 2015). For the obstacle avoidance task... During the policy update phase, we use a minibatch size of 12500. For the manipulation task... During the policy update phase, we use a minibatch size of 8,192. The KL-Divergence loss is weighted by 0.01 in the policy update... For our alignment training, the KL-Divergence is weighted by 0.05 and added to the task reward.