Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation

Authors: Yifei Su, Dong An, Kehan Chen, Weichen Yu, Baiyang Ning, Yonggen Ling, Yan Huang, Liang Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our explicit entity-landmark alignment learning is beneficial for AVDN. As a result, FELA achieves leading performance with 3.2% SR and 4.9% GP improvements over prior arts. The proposed method is evaluated on the ANDH task (Fan et al. 2023a), which is the only available benchmark for Aerial Vision-Dialog Navigation. The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds. These sub-trajectories are further divided into 4 splits via their scene types, including 4591 for training, 370 for seen validation, 411 for unseen validation, and others for unseen testing. Evaluation Metrics. We use the standard metrics for evaluation (Fan et al. 2023a), including: 1) Success Rate (SR): the ratio of predicted paths being regarded as successful; 2) Success weighted by inverse Path Length (SPL): SR weighted by the total length of the navigation path; 3) Goal Progress (GP): the distance of the navigation progress towards the destination area.
Researcher Affiliation Collaboration Yifei Su1,2, Dong An3, Kehan Chen1,2, Weichen Yu4, Baiyang Ning1,2, Yonggen Ling5, Yan Huang1,2 , Liang Wang1,2, 1School of Artificial Intelligence, University of Chinese Academy of Sciences 2MAIS, Institute of Automation of Chinese Academy of Sciences 3Mohamed bin Zayed University of Artificial Intelligence 4Electrical and Computer Engineering Department, Carnegie Mellon University 5Robotics X, Tencent, Shenzhen, China
Pseudocode No The paper does not contain an explicit pseudocode block or algorithm section. It describes the methods in paragraph form and through equations.
Open Source Code Yes Code https://github.com/yifeisu/FELA
Open Datasets Yes Aerial Vision-Dialog Navigation (AVDN) is a new task... Fan et al. (Fan et al. 2023a) propose a challenging ANDH task... The proposed method is evaluated on the ANDH task (Fan et al. 2023a), which is the only available benchmark for Aerial Vision-Dialog Navigation. The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds.
Dataset Splits Yes The ANDH task splits the AVDN dataset into 6269 sub-trajectories according to dialog rounds. These sub-trajectories are further divided into 4 splits via their scene types, including 4591 for training, 370 for seen validation, 411 for unseen validation, and others for unseen testing.
Hardware Specification Yes Our experiments are conducted on two NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions software components like Yolov5-x, Roberta, and Swin-Tiny backbones, and the Adam W optimizer, but does not provide specific version numbers for these or other ancillary software (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes All models are optimized for 200,000 iterations ( 50 hours) with a batch size of 8 and a learning rate of 1e-5 via Adam W optimizer. The hidden size of dialog encoding, history encoding, and semantic grid representation D is uniformly set to 768. The number of transformer layers for the text encoder and episodic transformer is set to 9 and 3, respectively. For weight coefficients, we set κ1, κ2 in Formula 10 to 1, 0.1, respectively. The τ in Formula 8 is set to 0.02 following (Jiang and Ye 2023).