CoMotion: Concurrent Multi-person 3D Motion
Authors: Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan Richter, Vladlen Koltun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudolabeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Section 5 (EXPERIMENTS) details quantitative comparisons using metrics like MOTA, IDF1, MPJPE, and PCK on datasets like Pose Track21 and 3DPW, which are hallmarks of experimental research. |
| Researcher Affiliation | Academia | The paper does not provide explicit institutional affiliations, department names, city, country, or email addresses for the authors within the main text or abstract. Therefore, a classification of affiliation type (academia, industry, or collaboration) cannot be accurately determined from the provided text. The value '0' is used as a placeholder to satisfy the integer type constraint for the 'result' field, but it does not represent a determined classification. |
| Pseudocode | Yes | Figure 8: Pseudocode for the image encoder. Figure 10: Pseudocode for the update step. |
| Open Source Code | No | We will release the implementation and the trained model upon publication. |
| Open Datasets | Yes | Specifically, we train on Insta Variety (Kanazawa et al., 2019), COCO (Lin et al., 2014), and MPII (Andriluka et al., 2014)... In addition, we train on videos with ground truth tracks from Pose Track (Andriluka et al., 2018) and Dance Track (Sun et al., 2022)... we include BEDLAM (Black et al., 2023), which consists of scenes with many people with sampled motions sourced from AMASS (Mahmood et al., 2019), and WHAC-A-MOLE (Yin et al., 2024)... We evaluate Co Motion on a subset of its capabilities at a time... To evaluate tracking across crowded sequences with interesting poses, we turn to Pose Track21 (Doering et al., 2022)... and Mean Per-Joint Position Error (MPJPE) on 3DPW (von Marcard et al., 2018). |
| Dataset Splits | No | The paper mentions training on 'short, 8-frame video clips' and 'longer video sequences' of specific lengths (e.g., 96 from Dance Track, 32 from WHAC-A-MOLE). It also mentions evaluating on the 'Pose Track21 validation set', '3DPW validation set', and 'a subset of clips from Ego Humans'. While it indicates which parts of the datasets are used for training or evaluation, it does not provide specific overall dataset split percentages (e.g., 80/10/10) or absolute sample counts for train/test/validation, nor does it explicitly cite predefined splits for the full datasets used. |
| Hardware Specification | Yes | The model is trained for 400K iterations on 32 A100 GPUs taking approximately 3 days. All measurements were made on a V100 GPU using the code released by the respective authors. |
| Software Dependencies | No | We use the Conv Next V2 (Woo et al., 2023) implementation provided in the timm library (Wightman, 2019). Figure 8 and Figure 10 refer to 'Py Torch-like pseudocode'. However, specific version numbers for the 'timm' library or PyTorch are not provided in the text. |
| Experiment Setup | Yes | We follow a three-stage curriculum to train Co Motion... Stage 1... The model is trained for 400K iterations on 32 A100 GPUs taking approximately 3 days. To optimize the model's SMPL predictions, we employ the following loss functions: an L1-loss to minimize 2D projection error, an L1-loss for the root-normalized 3D joint position error, and an L2-loss for the difference in SMPL joint angles. We supervise the betas on samples from BEDLAM with an L1-loss... We apply a binary cross-entropy loss to supervise the output confidence term... Additionally, we use a keypoint heatmap loss... Stage 2... We train for 200k iterations, which take 3 more days... Stage 3... The model is fine-tuned for 50K iterations over 1.5 days. To reduce GPU memory consumption, we enable gradient checkpointing. Frames are padded and resized to 512x512. |