Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS. We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections... We demonstrate results on animals given monocular RGBD videos captured by a smartphone. ... We evaluate camera registration using GT cameras estimated from annotated 2D correspondences. ... Our method outperforms both the multi-video and single-video versions of Total Recon in terms of depth accuracy and LPIPS, due to the ability of leveraging multiple videos.
Researcher Affiliation Collaboration 1Codec Avatar Labs, Meta 2Carnegie Mellon University 3UC Berkeley
Pseudocode No The paper describes the methods in narrative text and mathematical equations, such as Equation 7 for the score function and Equation 8 for training, but does not present any distinct pseudocode or algorithm blocks.
Open Source Code No Project page: gengshan-y.github.io/agent2simwww/. This is a project page or high-level overview, not a direct link to a source-code repository for the described methodology.
Open Datasets No Dataset. We collect a dataset that emphasizes interactions of an agent with the environment and the observer. As shown in Tab. 2, it contains RGBD i Phone video collections of 4 agents in 3 different scenes... The paper describes collecting its own dataset but provides no information regarding its public availability or access.
Dataset Splits Yes We use the cat dataset for quantitative evaluation, where the data are split into a training set of 22 videos and a test set of 1 video.
Hardware Specification Yes 8 A100 GPUs are used to optimize 23 videos of the cat data, and 1 A100 GPU is used in a 2-3 video setup (for dog, bunny, and human). ... Training takes 10 hours on a single A100 GPU.
Software Dependencies No We extract frames at 10 FPS and compute augmented image measurements, including object segmentation (Yang et al., 2023b), optical flow (Yang & Ramanan, 2019), DINOv2 features (Oquab et al., 2023). We use Adam W to first optimize the environment with feature-metric loss for 30k iterations... The paper mentions several software tools and libraries like DINOv2, Adam W, but does not specify any version numbers for these components.
Experiment Setup Yes We extract frames at 10 FPS... We use Adam W to first optimize the environment with feature-metric loss for 30k iterations, and then jointly optimize the environment and agent for another 30k iterations... We use Adam W to optimize the parameters of the scores functions {θZ, θP, θG} and the ego-perception encoders {θψ, θo, θp} for 120k steps with batch size 1024. Each diffusion model is trained with random dropout of the conditioning (Ho & Salimans, 2022).