Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Towards Learning to Imitate from a Single Video Demonstration
Authors: Glen Berseth, Florian Golemo, Christopher Pal
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and quadruped and humanoid agents in 3D. We show that our method outperforms current state-of-the-art techniques and can learn to imitate behaviours from a single video demonstration. ... We perform experiments for multiple simulated robots in both 2D and 3D, including simulations for training quadruped robots and a humanoid with 38 degrees of freedom (DoF). |
| Researcher Affiliation | Collaboration | Glen Berseth Universit e de Montr eal, Mila Quebec AI Institute, Institut Courtois, and Canada CIFAR AI Chair EMAIL Florian Golemo Mila Quebec AI Institute EMAIL Christopher Pal Polytechnique Montr eal, Mila Quebec AI Institute, Service Now Research, and Canada CIFAR AI Chair EMAIL |
| Pseudocode | Yes | Algorithm 1 VIRL |
| Open Source Code | No | Additionally, the dog, raptor, zombie walk, run and jumping policy can be found on the project website: https://sites.google.com/view/virl1. ... The resulting behaviours learned in simulation are available at: https://sites.google.com/view/virl1. (The paper only mentions that videos of learned behaviors and policies are available on the project website, not the source code itself.) |
| Open Datasets | Yes | We are using the mocap data from the CMU Graphics Lab Motion Capture Database from 2002 (http://mocap.cs.cmu.edu/). |
| Dataset Splits | No | We collect 2048 samples between training rounds. The batch size for TRPO is 2048. ... The data used to train the Siamese network is a combination of observation trajectories O = o0, . . . , o T generated from simulating the agent in the environment and the demonstration. (The paper describes data collection for RL training and for training the Siamese network, but it does not specify explicit training, validation, or test splits for any dataset.) |
| Hardware Specification | Yes | It takes 5–7 days to train each policy in these results on a 16 core machine with an Nvidia GTX1080 GPU. |
| Software Dependencies | No | We train the agent s policy using the trust-region policy optimization (TRPO) algorithm (Schulman et al., 2015). ... The image encoder convnet is φ, the image decoder ψ, the recurrent encoder ω, and the recurrent decoder ρ. The weights for λ are found by empirically evaluating VIRL over all environments from section 5. Additional details on the hyperparameter search can be found in subsection 7.8. (The paper mentions the TRPO algorithm and uses neural network components, but does not specify software dependencies with version numbers like Python, PyTorch, TensorFlow, or specific library versions.) |
| Experiment Setup | Yes | Where the relative weights of the different terms are λ1:4 = {0.7, 0.1, 0.1, 0.1}... The batch size for TRPO is 2048. The kl term is 0.5. ... The margin ρ is set to 1... We normalize the distance metric outputs using r = exp(r2 wd) where wd = 5.0 scales the filtering width. |