reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Augmentation for Instruction Following Policies via Trajectory Segmentation

Authors: Niklas Hoepner, Ilaria Tiddi, Herke van Hoof

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results in a game environment and a simulated robotic gripper setting underscore the importance of segmentation; randomly sampled segments diminish performance, while incorporating labelled segments from PS improves policy performance to the level of a policy trained on twice the amount of labelled data. The goal of the evaluation is to assess the capability of the different segmentation models to extract labelled segments from the play trajectories that can be used for data augmentation in an imitation learning context. In Section 4.1 the two evaluation environments are introduced and in Section 4.2 the importance of segmentation for successful data augmentation is highlighted. In Section 4.3 we compare the downstream-policy performance resulting from the different segmentation models and investigate reasons for the performance differences. We measure the quality of the labelled segmentation via the accuracy of the assigned labels as well as the precision and recall of the segmentation points. While extracted labelled segments from Tri Det have a negative impact on policy performance, data augmentation via Play Segmentation improves the performance to a level higher than if we had twice the amount of labelled data available (see Table 3). The extracted segments from Un Loc have little effect on the policies performance. In Table 4 the results of the data augmentation are shown.
Researcher Affiliation	Academia	1University of Amsterdam 2Vrije Universiteit Amsterdam EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the approach using probabilistic model factorisation and dynamic programming recursion equations, but it does not present these in a clearly labeled 'Pseudocode' or 'Algorithm' block. For example, 'We can use dynamic programming (DP) to find α 0:T 1 via the recursion: max α0:T 1 log pθSeg(α0:T 1\|o0:T ) = max i {0,...,T 1}(max α0:i log pθSeg(α0:i\|o0:i+1)+ log pθSeg(αi+1:T 1 = (0, ..., 1)\|oi+1:T )).' This is a mathematical description, not a pseudocode block.
Open Source Code	Yes	Code https://github.com/Nike Hop/Play Segmentation AAAI2025
Open Datasets	Yes	Environments Baby AI (Chevalier-Boisvert et al. 2019) is a grid-based environment with a range of difficulty levels designed to test instruction following agents. CALVIN (Mees et al. 2022) is a dataset containing play trajectories of a simulated 7-DOF Franka Emika Panda robot arm acting in a tabletop environment (Figure 3).
Dataset Splits	Yes	To assess the policy s performance with varying levels of annotated data, we subsample for both environments the annotated dataset into subsets of 10%, 25%, and 50%. Subsequently, the policy is trained for each subset. For the Baby AI environment we start with a subset consisting of 10% of the labelled dataset and for the CALVIN environment with a subset consisting of 25% of the labelled dataset.
Hardware Specification	No	The paper mentions experiments in a 'game environment' and a 'simulated robotic gripper setting', but it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run these simulations or train the models.
Software Dependencies	No	The paper references various models and architectures like 'CLIP embeddings (Radford et al. 2021)', 'I3D architecture', 'Un Loc (Yan et al. 2023)', and 'Tri Det (Shi et al. 2023)'. However, it does not specify any software libraries or frameworks with their version numbers (e.g., Python version, PyTorch version, TensorFlow version) that were used for implementation or experimentation.
Experiment Setup	No	The paper describes evaluation metrics and general training methodologies, such as 'policy is trained via imitation learning' and 'multi-context imitation learning (MCIL)'. It specifies that 'we evaluate policies by measuring the percentage of tasks solved within 25 timesteps over 512 episodes' and 'The final evaluation score is the average number of instructions completed, computed over a 1000 sequences'. However, it does not provide specific hyperparameters like learning rates, batch sizes, number of epochs, or details about the optimizer used in the main text.