3D-aware Select, Expand, and Squeeze Token for Aerial Action Recognition
Authors: Luying Peng, Xiangbo Shu, Yazhou Yao, Guo-Sen Xie
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments Datasets UAV-Human is the largest UAV-based human behavior understanding dataset (Li et al. 2021). This dataset contains 20,728 high-definition videos captured in various indoor and outdoor settings, encompassing a broad range of lighting and weather conditions. The videos cover dynamic backgrounds and UAVs with diverse motions and altitudes, making this dataset highly challenging. A total of 155 unique actions have been annotated, with some being difficult to differentiate, such as squeeze and yawn actions. Compared with existing works, we use split 1 which contains 15,172 and 5,556 videos for training and testing, respectively. Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Drone-Action is an outdoor video dataset that was captured using a free-flying UAV, with 168 training clips and 72 testing clips. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022). We use 87 long real videos captured from the air with 7 action categories for training and 91 rest for testing. Experimental Settings We uniformly sample 16/8 frames to generate each input video X R16/8 224 224 and apply the standard Random Resized Crop and Random Horizontal Flip augmentation strategy for data preprocessing, following (Herzig et al. 2022; Patrick et al. 2021). We use a batch size of 4, and an SGD optimizer using a momentum of 0.9 and a weight decay of 5 10 4. The maximum training epoch is set to 100, with the initial learning rate 10 5, decreased by a factor of 10 for every 20 epochs. τ in Eq. (10) and Eq. (11) is set to 0.1. α and β in Eq. (12) are set to 0.2 and 0.1, respectively. The whole model is implemented using Pytorch on two NVIDIA RTX 3090 GPUs. Comparison with State-of-the-art Results on UAV-Human. Table 1 shows the performance comparison on the UAV-Human Dataset. 3D-Tok achieves the best performance, gaining the performance improvement of 7.7%, 8.1%, and 9.5% under various settings for the number of frames and frame sizes. Ablation Study Effect of Each Module. The framework of the proposed 3D-Tok mainly includes two modules, i.e., 3D-Token Selector (3TS), and Expand-Squeeze Converter (ESC). We conduct the ablation study to validate the superiority of each module in terms of recognition accuracy, as shown in Table 3. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations (e.g., Eq. 1-13) but does not include any distinct, structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide any links to code repositories or mention code in supplementary materials. |
| Open Datasets | Yes | UAV-Human is the largest UAV-based human behavior understanding dataset (Li et al. 2021). Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022). |
| Dataset Splits | Yes | Compared with existing works, we use split 1 which contains 15,172 and 5,556 videos for training and testing, respectively. Drone-Action is a dataset for human action classification in aerial videos (Perera, Law, and Chahl 2019), which contains 240 aerial videos across 13 different human actions performed by 10 human actors. Drone-Action is an outdoor video dataset that was captured using a free-flying UAV, with 168 training clips and 72 testing clips. Ro Co G-v2 is a dataset that contains real and synthetic videos from air and ground perspectives (Reddy et al. 2022). We use 87 long real videos captured from the air with 7 action categories for training and 91 rest for testing. |
| Hardware Specification | Yes | The whole model is implemented using Pytorch on two NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions that the model is implemented using "Pytorch" but does not specify a version number for Pytorch or any other software dependencies. |
| Experiment Setup | Yes | We uniformly sample 16/8 frames to generate each input video X R16/8 224 224 and apply the standard Random Resized Crop and Random Horizontal Flip augmentation strategy for data preprocessing, following (Herzig et al. 2022; Patrick et al. 2021). We use a batch size of 4, and an SGD optimizer using a momentum of 0.9 and a weight decay of 5 10 4. The maximum training epoch is set to 100, with the initial learning rate 10 5, decreased by a factor of 10 for every 20 epochs. τ in Eq. (10) and Eq. (11) is set to 0.1. α and β in Eq. (12) are set to 0.2 and 0.1, respectively. |