VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

Authors: Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, Jun Liu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR s effectiveness across a broad spectrum of action recognition scenarios. We conducted a comprehensive comparison with 11 methods based on skeleton data for action recognition. The experiments covered five large datasets, with published results copied into the paper, and unpublished results obtained through retraining on these datasets. We conducted rigorous validations on both Joint data and multi-modalities data.
Researcher Affiliation Academia 1Beijing University of Posts and Telecommunications 2Macau University of Science and Technology 3China Institute of Sport Science 4Beijing Sport University EMAIL
Pseudocode No The paper provides architectural diagrams and mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code github.com/Trinity Neo99/VA-AR official
Open Datasets Yes We meticulously selected five representative action recognition datasets to comprehensively assess the performance of the proposed method. These datasets comprise NTU RGB+D (NTU-60) (Shahroudy et al. 2016), NTU RGB+D (NTU-120) (Liu et al. 2019), P2A (Bian et al. 2022), Olympic Badminton (Ghosh, Singh, and Jawahar 2018), and Fine Gym (Shao et al. 2020).
Dataset Splits No The paper refers to 'X-Sub', 'X-View', and 'X-Set' as splits used for reporting results in Table 1 for specific datasets, which are standard benchmarks. However, it does not explicitly provide the split percentages, sample counts, or the methodology for these splits within the main text. It also mentions 'We mixed the test sets of the five datasets and carefully distinguished them based on action speed' but this is for analysis, not the original experimental splits.
Hardware Specification Yes During the experimental process, we employed two NVIDIA 3090 GPUs for training, which encompassed a total of 60 training epochs.
Software Dependencies No The paper mentions using a Graph Convolutional Network (GCN) and SGD optimizer but does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes We have selected to use the Graph Convolutional Network (GCN) as the Spatial Module and have configured three STBlocks. Additionally, we have employed three different window sizes of 4, 8, and 16. During the experimental process, we employed two NVIDIA 3090 GPUs for training, which encompassed a total of 60 training epochs. For the optimizer, we have chosen SGD with a momentum of 0.9 and a weight decay parameter of 0.0001, with batch size setting to 32. The maximum temporal lengths for the NTU-60, NTU-120, and Fine Gym datasets were set to 256; whereas, for the P2A and Olympic Badminton datasets, the maximum temporal lengths were set to 128.