Harmonious Music-driven Group Choreography with Trajectory-Controllable Diffusion
Authors: Yuqin Dai, Wanlu Zhu, Ronghui Li, Zeping Ren, Xiangzheng Zhou, Jixuan Ying, Jun Li, Jian Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate our method s superiority. Experimental results demonstrate the superiority of our approach over existing methods. Experimental Settings Implementation Details. For our Dance-Trajectory Navigator, the λv = λDC = 2, and the hidden size of all module layers is set to 64. The Trajectory transformer, which is stacked with M = 6 transformer layers, is equipped with 8 heads of attention. The λRFK = 0.6, λvel = 3, the λjoint = 0.6, and the λcontact = 10. Both the LSTM model and the Music-MLP consist of 3 layers each. The Final-MLP processes the information passed to it through 4 layers, utilizing Leaky Re LU non-linearity as the activation function. The sequence length L = 120, the hidden dimension is 512, with N = 8 layers and 8 heads of attention. We apply a 3-layer MLP as a Fusion Projection, followed by Re LU activation at each layer. Additionally, we stack W = 3 Concat Squash Linear with a hidden size of dcsl = 128 and dctx = 512. The entire framework was trained on 4 Nvidia 4090 GPUs for 3 days. We use a single 4090 GPU to train the Dance-Trajectory Navigator for 26 hours, utilizing batch sizes of 750, 400, 256, and 170 for 2, 3, 4, and 5 dancers, respectively. Similarly, the TSDiff model was trained on 4 NVIDIA 4090 GPUs for 2 days, employing batch sizes of 60, 53, 32, and 20, in that order. Dataset. AIOZ-GDance dataset (Le et al. 2023b) is an extensive repository of group dance performances comprising 16.7 hours of synchronized music and 3D multi-dancer motion data. Comparison to the State of the Art Qualitative Visual Comparison. The performance of our model is illustrated in Figures 4 and 5, highlighting its ability to generate aesthetically pleasing results across various group sizes. Top-view dancer trajectories in Figure 6 further demonstrate our model s superiority in minimizing overlaps. Quantitative Results. Tables 1 and 2 compare our model s performance with baseline methods. Our model consistently outperforms in group-dance metrics and excels in Div and PFC for single-dance Metrics. Ablation Study Effectiveness of Conditional Motion Denoising. Table 3 shows that our Conditional Motion Denoising (CMD) improves GMR, GMC, FID, MMC, and PFC metrics. |
| Researcher Affiliation | Academia | 1PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 2Shenzhen International Graduate School, Tsinghua University, Shenzhen, China EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and through diagrams (e.g., Figure 2 for the TCDiff framework) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project Page https://wanluzhu.github.io/TCDiffusion/ The paper provides a project page URL, but it does not explicitly state that source code for the described methodology is available at this URL, nor does it provide a direct link to a code repository. The provided URL is a general project page. |
| Open Datasets | Yes | Dataset. AIOZ-GDance dataset (Le et al. 2023b) is an extensive repository of group dance performances comprising 16.7 hours of synchronized music and 3D multi-dancer motion data. |
| Dataset Splits | Yes | Following the partition setting (Le et al. 2023b). we randomly sample all videos into train, validation and test sets with 80%, 10% and 10% of total videos, respectively. |
| Hardware Specification | Yes | The entire framework was trained on 4 Nvidia 4090 GPUs for 3 days. We use a single 4090 GPU to train the Dance-Trajectory Navigator for 26 hours, utilizing batch sizes of 750, 400, 256, and 170 for 2, 3, 4, and 5 dancers, respectively. Similarly, the TSDiff model was trained on 4 NVIDIA 4090 GPUs for 2 days, employing batch sizes of 60, 53, 32, and 20, in that order. |
| Software Dependencies | No | For the music feature, we follow prior works (Kim et al. 2022; Li et al. 2024c) to utilize Librosa (Mc Fee et al. 2015) to extract a representation M R35, comprising a 1-dimensional envelope, 20-dimensional MFCC, 12-dimensional chroma, along with 1-dimensional one-hot peaks and 1-dimensional one-hot beats. The paper mentions using 'Librosa' but does not specify a version number for it or any other key software dependencies. |
| Experiment Setup | Yes | Implementation Details. For our Dance-Trajectory Navigator, the λv = λDC = 2, and the hidden size of all module layers is set to 64. The Trajectory transformer, which is stacked with M = 6 transformer layers, is equipped with 8 heads of attention. The λRFK = 0.6, λvel = 3, the λjoint = 0.6, and the λcontact = 10. Both the LSTM model and the Music-MLP consist of 3 layers each. The Final-MLP processes the information passed to it through 4 layers, utilizing Leaky Re LU non-linearity as the activation function. The sequence length L = 120, the hidden dimension is 512, with N = 8 layers and 8 heads of attention. We apply a 3-layer MLP as a Fusion Projection, followed by Re LU activation at each layer. Additionally, we stack W = 3 Concat Squash Linear with a hidden size of dcsl = 128 and dctx = 512. |