Harmonious Music-driven Group Choreography with Trajectory-Controllable Diffusion

Authors: Yuqin Dai, Wanlu Zhu, Ronghui Li, Zeping Ren, Xiangzheng Zhou, Jixuan Ying, Jun Li, Jian Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate our method s superiority. Experimental results demonstrate the superiority of our approach over existing methods. Experimental Settings Implementation Details. For our Dance-Trajectory Navigator, the λv = λDC = 2, and the hidden size of all module layers is set to 64. The Trajectory transformer, which is stacked with M = 6 transformer layers, is equipped with 8 heads of attention. The λRFK = 0.6, λvel = 3, the λjoint = 0.6, and the λcontact = 10. Both the LSTM model and the Music-MLP consist of 3 layers each. The Final-MLP processes the information passed to it through 4 layers, utilizing Leaky Re LU non-linearity as the activation function. The sequence length L = 120, the hidden dimension is 512, with N = 8 layers and 8 heads of attention. We apply a 3-layer MLP as a Fusion Projection, followed by Re LU activation at each layer. Additionally, we stack W = 3 Concat Squash Linear with a hidden size of dcsl = 128 and dctx = 512. The entire framework was trained on 4 Nvidia 4090 GPUs for 3 days. We use a single 4090 GPU to train the Dance-Trajectory Navigator for 26 hours, utilizing batch sizes of 750, 400, 256, and 170 for 2, 3, 4, and 5 dancers, respectively. Similarly, the TSDiff model was trained on 4 NVIDIA 4090 GPUs for 2 days, employing batch sizes of 60, 53, 32, and 20, in that order. Dataset. AIOZ-GDance dataset (Le et al. 2023b) is an extensive repository of group dance performances comprising 16.7 hours of synchronized music and 3D multi-dancer motion data. Comparison to the State of the Art Qualitative Visual Comparison. The performance of our model is illustrated in Figures 4 and 5, highlighting its ability to generate aesthetically pleasing results across various group sizes. Top-view dancer trajectories in Figure 6 further demonstrate our model s superiority in minimizing overlaps. Quantitative Results. Tables 1 and 2 compare our model s performance with baseline methods. Our model consistently outperforms in group-dance metrics and excels in Div and PFC for single-dance Metrics. Ablation Study Effectiveness of Conditional Motion Denoising. Table 3 shows that our Conditional Motion Denoising (CMD) improves GMR, GMC, FID, MMC, and PFC metrics.
Researcher Affiliation Academia 1PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 2Shenzhen International Graduate School, Tsinghua University, Shenzhen, China EMAIL EMAIL
Pseudocode No The paper describes the methodology in prose and through diagrams (e.g., Figure 2 for the TCDiff framework) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project Page https://wanluzhu.github.io/TCDiffusion/ The paper provides a project page URL, but it does not explicitly state that source code for the described methodology is available at this URL, nor does it provide a direct link to a code repository. The provided URL is a general project page.
Open Datasets Yes Dataset. AIOZ-GDance dataset (Le et al. 2023b) is an extensive repository of group dance performances comprising 16.7 hours of synchronized music and 3D multi-dancer motion data.
Dataset Splits Yes Following the partition setting (Le et al. 2023b). we randomly sample all videos into train, validation and test sets with 80%, 10% and 10% of total videos, respectively.
Hardware Specification Yes The entire framework was trained on 4 Nvidia 4090 GPUs for 3 days. We use a single 4090 GPU to train the Dance-Trajectory Navigator for 26 hours, utilizing batch sizes of 750, 400, 256, and 170 for 2, 3, 4, and 5 dancers, respectively. Similarly, the TSDiff model was trained on 4 NVIDIA 4090 GPUs for 2 days, employing batch sizes of 60, 53, 32, and 20, in that order.
Software Dependencies No For the music feature, we follow prior works (Kim et al. 2022; Li et al. 2024c) to utilize Librosa (Mc Fee et al. 2015) to extract a representation M R35, comprising a 1-dimensional envelope, 20-dimensional MFCC, 12-dimensional chroma, along with 1-dimensional one-hot peaks and 1-dimensional one-hot beats. The paper mentions using 'Librosa' but does not specify a version number for it or any other key software dependencies.
Experiment Setup Yes Implementation Details. For our Dance-Trajectory Navigator, the λv = λDC = 2, and the hidden size of all module layers is set to 64. The Trajectory transformer, which is stacked with M = 6 transformer layers, is equipped with 8 heads of attention. The λRFK = 0.6, λvel = 3, the λjoint = 0.6, and the λcontact = 10. Both the LSTM model and the Music-MLP consist of 3 layers each. The Final-MLP processes the information passed to it through 4 layers, utilizing Leaky Re LU non-linearity as the activation function. The sequence length L = 120, the hidden dimension is 512, with N = 8 layers and 8 heads of attention. We apply a 3-layer MLP as a Fusion Projection, followed by Re LU activation at each layer. Additionally, we stack W = 3 Concat Squash Linear with a hidden size of dcsl = 128 and dctx = 512.