MagicPose4D: Crafting Articulated Models with Appearance and Motion Control

Authors: Hao Zhang, Di Chang, Fang Li, Mohammad Soleymani, Narendra Ahuja

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that Magic Pose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks. https: //magicpose4d.github.io/ 4 Experiments This section primarily covers the experimental results and comparison of the following tasks: (i) 4D generation; (ii) 4D reconstruction; and (iii) Motion Transfer. Detailed descriptions of the benchmarks (Sec. A.3), and implementation (Sec. A.4) are in the Appendix. Additional visual results can be found on our anonymous webpage: https://magicpose4d.github.io/ and Google Drive: link. Table 1: User study of Magic Pose4D on 4D Generation compared to 4Dfy, DMT+Stag4D, and SVD+Stag4D. Criteria for judgment: (1) Appearance matches appearance prompts, (2) Motion matches text description, and (3) The overall generation quality. S4D is short for Stag4D. Prompts of experiments are shown in Tab.5.
Researcher Affiliation Academia Hao Zhang EMAIL University of Illinois Urbana-Champaign Di Chang EMAIL University of Southern California Fang Li EMAIL University of Illinois Urbana-Champaign Mohammad Soleymani EMAIL University of Southern California Narendra Ahuja EMAIL University of Illinois Urbana-Champaign
Pseudocode No The paper describes the methodology in detail, including equations and module descriptions, but does not present any formal pseudocode or algorithm blocks with explicit labels like "Algorithm 1".
Open Source Code No Through extensive experiments, we demonstrate that Magic Pose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks. https: //magicpose4d.github.io/ Additional visual results can be found on our anonymous webpage: https://magicpose4d.github.io/ and Google Drive: link.
Open Datasets Yes Davis-Camel provides a real animal video in BADJA Biggs et al. (2018) with 2D keypoints and mask annotations, derived from the DAVIS video segmentation dataset Perazzi et al. (2016) and online stock footage. We extract reference motion from the reconstructed mesh sequence and transfer it to other identities. Planet Zoo includes RGB synthetic videos of different animals with around 100 frames each. Planet Zoo covers a 180-degree visual field captured by a moving camera to allow better evaluation of 3D reconstruction when imaging parameters must also be dynamically estimated due to the moving camera, and over a large visual field. In addition, following BADJA Biggs et al. (2018), we also provide 2D key point annotations. Deforming Things4D is a synthetic dataset containing 1,972 animation sequences containing 31 categories of both humanoids and animals. Each sequence consists of 40 to 120 frames of motion animation. In this dataset, the first frame is the canonical frame, and its triangle mesh is given. From the 2nd to the last frame, the 3D offsets of the mesh vertices are provided, and we export the triangle meshes for all these frames. We use the motions of these animal mesh sequences as pose references and transfer the pose to different identities. Everybody Dance Now consists of full-body videos of five human subjects. We use these monocular videos to generate human motions and transfer them to other identities.
Dataset Splits No The paper describes the datasets used and their characteristics (e.g., number of sequences, frames per sequence), but does not specify how these datasets were split into training, validation, or test sets (e.g., percentages, absolute counts, or predefined splits from citations).
Hardware Specification Yes For Davis-Camel, Planet Zoo, Everybody Dance Now, and those self-collected data without ground truth mesh, we first train the Dual-Phase 4D Reconstruction Module on 2 NVIDIA L40S GPUs with batch size 16 for 10 epochs with a learning rate of 0.0001.
Software Dependencies No The paper mentions using Zero123 Liu et al. (2023) and SF3D Boss et al. (2024) as image-to-3D models, but it does not specify any general software dependencies or library versions (e.g., Python, PyTorch, CUDA versions) used for their own implementation.
Experiment Setup Yes For Davis-Camel, Planet Zoo, Everybody Dance Now, and those self-collected data without ground truth mesh, we first train the Dual-Phase 4D Reconstruction Module on 2 NVIDIA L40S GPUs with batch size 16 for 10 epochs with a learning rate of 0.0001.