reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Student-Informed Teacher Training

Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach, we first test it in a maze setting where the teacher can choose a shortcut that is invisible to the student. Our method successfully adjusts the behavior of the teacher to take a sub-optimal route, accounting for the limited observability of the student. Additionally, we apply our framework for training a robot manipulator to open a drawer while minimizing self-occlusion in front of a camera. Finally, we demonstrate the effectiveness of our method in the complex task of quadrotor flight through obstacles. Overall, our method leads to substantially higher student returns, reducing the gap between the teacher and student across tasks. As a result, our approach leads to very notable improvements in the student success rate in all considered environments. We compare our method to multiple baselines introduced below. We evaluate our student-informed teacher training framework on three diverse tasks: maze navigation in a tabular setting, vision-based obstacle avoidance with a quadrotor, and vision-based drawer opening using a robot arm. Finally, we perform ablations to understand the role of different aspects of our method.
Researcher Affiliation	Academia	Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza Robotics and Perception Group, University of Zurich, Switzerland *Equal contribution
Pseudocode	Yes	A pseudocode description of the different training phases with the crucial steps is provided in Algorithm 1. Algorithm 1: Student-Informed Teacher Training
Open Source Code	No	The project website is at https://rpg.ifi.uzh.ch/sitt/
Open Datasets	No	The paper describes custom environments built using public frameworks (Gym, Flightmare, Omniverse Isaac Gym) rather than specifying and providing access information for pre-existing, publicly available datasets. For instance, the 'Color Maze' environment was 'implemented... using the Gym framework', and the quadrotor task uses a 'customized RL training environment using Flightmare'.
Dataset Splits	No	The paper describes dynamic data generation through parallel environments and training runs, such as 'The training setup involves 1,000 parallel environments' and '256 runs with random initialization, each running for 1000 steps'. It does not specify fixed training/test/validation splits for a static dataset, as the environments generate data on the fly.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or processor types used for running the experiments.
Software Dependencies	No	The paper mentions software frameworks like 'Gym framework (Brockman et al., 2016)', 'Stable Baselines3 (Raffin et al., 2021) based on Pytorch (Paszke et al., 2017)', and 'RL-Games framework (Makoviichuk & Makoviychuk, 2021)'. However, it does not provide specific version numbers for these software components, which is required for a reproducible description.
Experiment Setup	Yes	In the policy update phase, the teacher network is updated using a batch containing all 250,000 collected experiences. In addition to the default PPO gradients, we apply the KL-Divergence gradient with a weighting coefficient of 0.001. The entropy coefficient for PPO is set to 0.3. During each paired alignment step, the observations stored in the roll-out buffer are used as a single batch to update the imitated student and student networks over 20 iterations. The networks are optimized using the L1 loss between corresponding features, with network weights updated via the ADAM optimizer (Kingma & Ba, 2015). For the obstacle avoidance task... During the policy update phase, we use a minibatch size of 12500. For the manipulation task... During the policy update phase, we use a minibatch size of 8,192. The KL-Divergence loss is weighted by 0.01 in the policy update... For our alignment training, the KL-Divergence is weighted by 0.05 and added to the task reward.