Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation
Authors: Yifei Su, Dong An, Kehan Chen, Weichen Yu, Baiyang Ning, Yonggen Ling, Yan Huang, Liang Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts. The proposed method is evaluated on the ANDH task (Fan et al. 2023a), which is the only available benchmark for Aerial Vision-Dialog Navigation. The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds. These sub-trajectories are further divided into 4 splits via their scene types, including 4591 for training, 370 for seen validation, 411 for unseen validation, and others for unseen testing. Evaluation Metrics. We use the standard metrics for evaluation (Fan et al. 2023a), including: 1) Success Rate (SR): the ratio of predicted paths being regarded as successful; 2) Success weighted by inverse Path Length (SPL): SR weighted by the total length of the navigation path; 3) Goal Progress (GP): the distance of the navigation progress towards the destination area. |
| Researcher Affiliation | Collaboration | Yifei Su1,2, Dong An3, Kehan Chen1,2, Weichen Yu4, Baiyang Ning1,2, Yonggen Ling5, Yan Huang1,2 , Liang Wang1,2, 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2MAIS, Institute of Automation of Chinese Academy of Sciences 3Mohamed bin Zayed University of Artificial Intelligence 4Electrical and Computer Engineering Department, Carnegie Mellon University 5Robotics X, Tencent, Shenzhen, China |
| Pseudocode | No | The paper does not contain an explicit pseudocode block or algorithm section. It describes the methods in paragraph form and through equations. |
| Open Source Code | Yes | Code https://github.com/yifeisu/FELA |
| Open Datasets | Yes | Aerial Vision-Dialog Navigation (AVDN) is a new task... Fan et al. (Fan et al. 2023a) propose a challenging ANDH task... The proposed method is evaluated on the ANDH task (Fan et al. 2023a), which is the only available benchmark for Aerial Vision-Dialog Navigation. The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds. |
| Dataset Splits | Yes | The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds. These sub-trajectories are further divided into 4 splits via their scene types, including 4591 for training, 370 for seen validation, 411 for unseen validation, and others for unseen testing. |
| Hardware Specification | Yes | Our experiments are conducted on two NVIDIA RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software components like Yolov5-x, Roberta, and Swin-Tiny backbones, and the Adam W optimizer, but does not provide specific version numbers for these or other ancillary software (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | All models are optimized for 200,000 iterations ( 50 hours) with a batch size of 8 and a learning rate of 1e-5 via Adam W optimizer. The hidden size of dialog encoding, history encoding, and semantic grid representation D is uniformly set to 768. The number of transformer layers for the text encoder and episodic transformer is set to 9 and 3, respectively. For weight coefficients, we set κ1, κ2 in Formula 10 to 1, 0.1, respectively. The τ in Formula 8 is set to 0.02 following (Jiang and Ye 2023). |