Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS. We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections... We demonstrate results on animals given monocular RGBD videos captured by a smartphone. ... We evaluate camera registration using GT cameras estimated from annotated 2D correspondences. ... Our method outperforms both the multi-video and single-video versions of Total Recon in terms of depth accuracy and LPIPS, due to the ability of leveraging multiple videos. |
| Researcher Affiliation | Collaboration | 1Codec Avatar Labs, Meta 2Carnegie Mellon University 3UC Berkeley |
| Pseudocode | No | The paper describes the methods in narrative text and mathematical equations, such as Equation 7 for the score function and Equation 8 for training, but does not present any distinct pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: gengshan-y.github.io/agent2simwww/. This is a project page or high-level overview, not a direct link to a source-code repository for the described methodology. |
| Open Datasets | No | Dataset. We collect a dataset that emphasizes interactions of an agent with the environment and the observer. As shown in Tab. 2, it contains RGBD i Phone video collections of 4 agents in 3 different scenes... The paper describes collecting its own dataset but provides no information regarding its public availability or access. |
| Dataset Splits | Yes | We use the cat dataset for quantitative evaluation, where the data are split into a training set of 22 videos and a test set of 1 video. |
| Hardware Specification | Yes | 8 A100 GPUs are used to optimize 23 videos of the cat data, and 1 A100 GPU is used in a 2-3 video setup (for dog, bunny, and human). ... Training takes 10 hours on a single A100 GPU. |
| Software Dependencies | No | We extract frames at 10 FPS and compute augmented image measurements, including object segmentation (Yang et al., 2023b), optical flow (Yang & Ramanan, 2019), DINOv2 features (Oquab et al., 2023). We use Adam W to first optimize the environment with feature-metric loss for 30k iterations... The paper mentions several software tools and libraries like DINOv2, Adam W, but does not specify any version numbers for these components. |
| Experiment Setup | Yes | We extract frames at 10 FPS... We use Adam W to first optimize the environment with feature-metric loss for 30k iterations, and then jointly optimize the environment and agent for another 30k iterations... We use Adam W to optimize the parameters of the scores functions {θZ, θP, θG} and the ego-perception encoders {θψ, θo, θp} for 120k steps with batch size 1024. Each diffusion model is trained with random dropout of the conditioning (Ho & Salimans, 2022). |