reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos

Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS. We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections... We demonstrate results on animals given monocular RGBD videos captured by a smartphone. ... We evaluate camera registration using GT cameras estimated from annotated 2D correspondences. ... Our method outperforms both the multi-video and single-video versions of Total Recon in terms of depth accuracy and LPIPS, due to the ability of leveraging multiple videos.
Researcher Affiliation	Collaboration	1Codec Avatar Labs, Meta 2Carnegie Mellon University 3UC Berkeley
Pseudocode	No	The paper describes the methods in narrative text and mathematical equations, such as Equation 7 for the score function and Equation 8 for training, but does not present any distinct pseudocode or algorithm blocks.
Open Source Code	No	Project page: gengshan-y.github.io/agent2simwww/. This is a project page or high-level overview, not a direct link to a source-code repository for the described methodology.
Open Datasets	No	Dataset. We collect a dataset that emphasizes interactions of an agent with the environment and the observer. As shown in Tab. 2, it contains RGBD i Phone video collections of 4 agents in 3 different scenes... The paper describes collecting its own dataset but provides no information regarding its public availability or access.
Dataset Splits	Yes	We use the cat dataset for quantitative evaluation, where the data are split into a training set of 22 videos and a test set of 1 video.
Hardware Specification	Yes	8 A100 GPUs are used to optimize 23 videos of the cat data, and 1 A100 GPU is used in a 2-3 video setup (for dog, bunny, and human). ... Training takes 10 hours on a single A100 GPU.
Software Dependencies	No	We extract frames at 10 FPS and compute augmented image measurements, including object segmentation (Yang et al., 2023b), optical flow (Yang & Ramanan, 2019), DINOv2 features (Oquab et al., 2023). We use Adam W to first optimize the environment with feature-metric loss for 30k iterations... The paper mentions several software tools and libraries like DINOv2, Adam W, but does not specify any version numbers for these components.
Experiment Setup	Yes	We extract frames at 10 FPS... We use Adam W to first optimize the environment with feature-metric loss for 30k iterations, and then jointly optimize the environment and agent for another 30k iterations... We use Adam W to optimize the parameters of the scores functions {θZ, θP, θG} and the ego-perception encoders {θψ, θo, θp} for 120k steps with batch size 1024. Each diffusion model is trained with random dropout of the conditioning (Ho & Salimans, 2022).