3D-aware Select, Expand, and Squeeze Token for Aerial Action Recognition

Authors: Luying Peng, Xiangbo Shu, Yazhou Yao, Guo-Sen Xie

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments Datasets UAV-Human is the largest UAV-based human behavior understanding dataset (Li et al. 2021). This dataset contains 20,728 high-definition videos captured in various indoor and outdoor settings, encompassing a broad range of lighting and weather conditions. The videos cover dynamic backgrounds and UAVs with diverse motions and altitudes, making this dataset highly challenging. A total of 155 unique actions have been annotated, with some being difficult to differentiate, such as squeeze and yawn actions. Compared with existing works, we use split 1 which contains 15,172 and 5,556 videos for training and testing, respectively. Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Drone-Action is an outdoor video dataset that was captured using a free-flying UAV, with 168 training clips and 72 testing clips. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022). We use 87 long real videos captured from the air with 7 action categories for training and 91 rest for testing. Experimental Settings We uniformly sample 16/8 frames to generate each input video X R16/8 224 224 and apply the standard Random Resized Crop and Random Horizontal Flip augmentation strategy for data preprocessing, following (Herzig et al. 2022; Patrick et al. 2021). We use a batch size of 4, and an SGD optimizer using a momentum of 0.9 and a weight decay of 5 10 4. The maximum training epoch is set to 100, with the initial learning rate 10 5, decreased by a factor of 10 for every 20 epochs. τ in Eq. (10) and Eq. (11) is set to 0.1. α and β in Eq. (12) are set to 0.2 and 0.1, respectively. The whole model is implemented using Pytorch on two NVIDIA RTX 3090 GPUs. Comparison with State-of-the-art Results on UAV-Human. Table 1 shows the performance comparison on the UAV-Human Dataset. 3D-Tok achieves the best performance, gaining the performance improvement of 7.7%, 8.1%, and 9.5% under various settings for the number of frames and frame sizes. Ablation Study Effect of Each Module. The framework of the proposed 3D-Tok mainly includes two modules, i.e., 3D-Token Selector (3TS), and Expand-Squeeze Converter (ESC). We conduct the ablation study to validate the superiority of each module in terms of recognition accuracy, as shown in Table 3.
Researcher Affiliation Academia Nanjing University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations (e.g., Eq. 1-13) but does not include any distinct, structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide any links to code repositories or mention code in supplementary materials.
Open Datasets Yes UAV-Human is the largest UAV-based human behavior understanding dataset (Li et al. 2021). Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022).
Dataset Splits Yes Compared with existing works, we use split 1 which contains 15,172 and 5,556 videos for training and testing, respectively. Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Drone-Action is an outdoor video dataset that was captured using a free-flying UAV, with 168 training clips and 72 testing clips. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022). We use 87 long real videos captured from the air with 7 action categories for training and 91 rest for testing.
Hardware Specification Yes The whole model is implemented using Pytorch on two NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions that the model is implemented using "Pytorch" but does not specify a version number for Pytorch or any other software dependencies.
Experiment Setup Yes We uniformly sample 16/8 frames to generate each input video X R16/8 224 224 and apply the standard Random Resized Crop and Random Horizontal Flip augmentation strategy for data preprocessing, following (Herzig et al. 2022; Patrick et al. 2021). We use a batch size of 4, and an SGD optimizer using a momentum of 0.9 and a weight decay of 5 10 4. The maximum training epoch is set to 100, with the initial learning rate 10 5, decreased by a factor of 10 for every 20 epochs. τ in Eq. (10) and Eq. (11) is set to 0.1. α and β in Eq. (12) are set to 0.2 and 0.1, respectively.