reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CoMotion: Concurrent Multi-person 3D Motion

Authors: Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan Richter, Vladlen Koltun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudolabeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Section 5 (EXPERIMENTS) details quantitative comparisons using metrics like MOTA, IDF1, MPJPE, and PCK on datasets like Pose Track21 and 3DPW, which are hallmarks of experimental research.
Researcher Affiliation	Academia	The paper does not provide explicit institutional affiliations, department names, city, country, or email addresses for the authors within the main text or abstract. Therefore, a classification of affiliation type (academia, industry, or collaboration) cannot be accurately determined from the provided text. The value '0' is used as a placeholder to satisfy the integer type constraint for the 'result' field, but it does not represent a determined classification.
Pseudocode	Yes	Figure 8: Pseudocode for the image encoder. Figure 10: Pseudocode for the update step.
Open Source Code	No	We will release the implementation and the trained model upon publication.
Open Datasets	Yes	Specifically, we train on Insta Variety (Kanazawa et al., 2019), COCO (Lin et al., 2014), and MPII (Andriluka et al., 2014)... In addition, we train on videos with ground truth tracks from Pose Track (Andriluka et al., 2018) and Dance Track (Sun et al., 2022)... we include BEDLAM (Black et al., 2023), which consists of scenes with many people with sampled motions sourced from AMASS (Mahmood et al., 2019), and WHAC-A-MOLE (Yin et al., 2024)... We evaluate Co Motion on a subset of its capabilities at a time... To evaluate tracking across crowded sequences with interesting poses, we turn to Pose Track21 (Doering et al., 2022)... and Mean Per-Joint Position Error (MPJPE) on 3DPW (von Marcard et al., 2018).
Dataset Splits	No	The paper mentions training on 'short, 8-frame video clips' and 'longer video sequences' of specific lengths (e.g., 96 from Dance Track, 32 from WHAC-A-MOLE). It also mentions evaluating on the 'Pose Track21 validation set', '3DPW validation set', and 'a subset of clips from Ego Humans'. While it indicates which parts of the datasets are used for training or evaluation, it does not provide specific overall dataset split percentages (e.g., 80/10/10) or absolute sample counts for train/test/validation, nor does it explicitly cite predefined splits for the full datasets used.
Hardware Specification	Yes	The model is trained for 400K iterations on 32 A100 GPUs taking approximately 3 days. All measurements were made on a V100 GPU using the code released by the respective authors.
Software Dependencies	No	We use the Conv Next V2 (Woo et al., 2023) implementation provided in the timm library (Wightman, 2019). Figure 8 and Figure 10 refer to 'Py Torch-like pseudocode'. However, specific version numbers for the 'timm' library or PyTorch are not provided in the text.
Experiment Setup	Yes	We follow a three-stage curriculum to train Co Motion... Stage 1... The model is trained for 400K iterations on 32 A100 GPUs taking approximately 3 days. To optimize the model's SMPL predictions, we employ the following loss functions: an L1-loss to minimize 2D projection error, an L1-loss for the root-normalized 3D joint position error, and an L2-loss for the difference in SMPL joint angles. We supervise the betas on samples from BEDLAM with an L1-loss... We apply a binary cross-entropy loss to supervise the output confidence term... Additionally, we use a keypoint heatmap loss... Stage 2... We train for 200k iterations, which take 3 more days... Stage 3... The model is fine-tuned for 50K iterations over 1.5 days. To reduce GPU memory consumption, we enable gradient checkpointing. Frames are padded and resized to 512x512.