Vision-Language Models for UAV Navigation: The foundation and frontier of vision-language navigation

Overview of the basic paradigm, core architecture and representative work of VLM+UAV navigation, covering the latest papers such as LogisticsVLN, OmniVLN, and ASMA

Vision-Language Models for UAV Navigation: The foundation and frontier of vision-language navigation

UAV Intelligent Series · Part X Focus: Basic paradigm, core architecture and representative work of VLM+UAV


1. Background: From verbal commands to autonomous flight

Traditional UAV path planning relies on precise mathematical objective functions (such as shortest path, minimum energy consumption), but real-world mission instructions are often fuzzy descriptions of natural language:

These instructions cannot be directly converted into mathematical optimization goals, but they can be understood and reasoned by VLM (Vision-Language Model). Vision-Language Navigation (VLN) is the core research direction to solve this problem - allowing robots (UAV) to navigate in three-dimensional physical space according to natural language instructions.


2. Task Definition: Core Issues of VLN

The VLN task can be formalized as:

Given a natural language instruction and a starting visual observation , let the agent perform a series of actions , and finally reach the target position described by the instruction.

The key challenges are:

  1. Semantic grounding: Mapping spatial relationships in language (“left”, “back”, “above”) to physical space
  2. Long Horizon Reasoning: Instructions often describe complex multi-step tasks
  3. Zero-sample generalization: Unseen buildings, environments, and objects
  4. Three-dimensional characteristics: UAV, unlike ground robots, has complete 3D movement capabilities

3. Representative work

3.1 LogisticsVLN: UAV VLN for terminal distribution (arXiv, 2025)Paper: LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

Author: Xinyuan Zhang, Yonglin Tian, Fei Lin, Yue Liu, Jing Ma, Kornélia Sára Szatmáry, Fei-Yue Wang Source: arXiv:2505.03460, May 2025

Core contribution:

Method Framework:

用户指令:"送包裹到红色大门旁边"

VLM 语义解析(物体检测 + 空间关系)

拓扑地图匹配(检测到的地标 vs 先验地图)

路径规划(全局粗规划 + 局部视觉重规划)

MPC 控制器执行

Key insights: This is currently the VLN work closest to actual UAV delivery scenarios, integrating the GPT-4V level visual language model with the physical control layer end-to-end.


3.2 OmniVLN: Open-ground cross-platform end-side VLN (arXiv, 2026)

Paper: OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms Author: Zhongyuang Liu, Min He, Shaonan Yu et al. Source: arXiv, March 2026

Core contribution:

  1. 3D token compression: Encode 3D spatial information into compact tokens to reduce the number of LLM input tokens
  2. Dynamic field of view management: Adaptively adjust the area of interest according to navigation needs
  3. Lightweight VLM backbone: client-side version based on Qwen-VL or LLaVA architecture

3.3 ASMA: Security Boundary-Aware UAV VLN (arXiv, 2024)

Paper: ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions Source: arXiv, September 2024

Core contribution:

Why it matters: Most VLN efforts focus on navigation accuracy and ignore safety. ASMA fills this gap - UAVs can make safety trade-offs between “not understanding instructions” and “hitting the wall”.


3.4 Vision-and-Language Navigation for UAVs: Overview (arXiv, 2026)

Paper: Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap Author: Hanxuan Chen, Jie Zheng, Siqi Yang et al. Source: arXiv:2604.xxxxx, April 2026

Overview Coverage:

---## 4. Technical architecture decomposition

4.1 Perception layer (Perception)

Camera Configuration:

TypeAdvantagesDisadvantages
Forward-facing RGBMature, cheapNarrow field of view, limited information
Omnidirectional camera360° perceptionLow resolution, large distortion
Depth cameraDense depthFailure outdoors, limited range
Multi-cameraStereo triangulationComplex calibration

Perception module responsibilities:

  1. Object detection + semantic segmentation (Grounding DINO, YOLO-World)
  2. Spatial relationship extraction (left and right, up and down, relative distance)
  3. Scene graph construction (object + relationship + topology)

4.2 Understanding layer

VLM selection comparison:

ModelParameter volumeVision capabilitiesEdge deploymentRepresentative work
GPT-4V~1.8TExtremely strongAcademic research
GPT-4o~200BExtremely strongCloud API
LLaVA-1.67B/13B/34BStrong✅ (ONNX)Local deployment
Qwen-VL7B/72BStrongChinese scene
CogVLM17BStrong⚠️Balanced Solution

4.3 Planning layer (Planning)

Existing planning paradigm:

  1. LLM as Planner: Directly let LLM output action sequences (ReAct, Reflexion)
    Instruction → LLM Reasoning → Action Sequence → Execution
  2. PDDL symbolic planning: LLM generates PDDL domain description, solved by classic planner
    • Representative: UniPlan
  3. Learnable Planning: End-to-end imitation learning/reinforcement learning
    • Advantages: Adapt to dynamic environments
    • Disadvantages: poor generalization

4.4 Control layer (Control)

UAV Control Features:- Requires real-time trajectory tracking (>100Hz control frequency)


5. Key Challenges

5.1 Sim2Real Gap

5.2 Inference delay vs real-time control

VLMInference delayApplicable scenarios
GPT-4o1-3sCloud offline planning
LLaVA-7B0.5-1sEdge delay planning
LLaVA-3B0.2-0.5sEdge real-time

Solution direction:

5.3 Three-dimensional spatial reasoning

The spatial relationships in language (“behind the tree”, “under the bridge”) are not simple projections in three-dimensional space.

Research Frontiers:


6. Data set summary| Dataset | Platform | Scale | Features |

|--------|------|------|------| | RxR | Ground | 126K commands | Multi-language, expert annotation | | VLN-CE | Ground | 61K trajectories | Matterport3D | | AI-TOD | UAV | ~20K commands | Aerial perspective, aerial photography | | UAV-VLN | UAV | ~10K | Urban Canyon Scene | | D3DROU | UAV | ~5K | Dynamic obstacles, real flight |


7. Future research directions

  1. Multi-modal fusion: RGB + Depth + Event Camera + LiDAR
  2. Small sample adaptation: LoRA / QLoRA fine-tuning to adapt to specific urban environments
  3. Multiple UAV collaboration VLN: Multiple UAVs collaborate to understand the same command
  4. World Model Assistance: Integrate World Model to predict future states
  5. Security Verification: Formal method to verify VLN decision security

📚 References1. Zhang et al. LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs. arXiv:2505.03460, 2025.

  1. Liu et al. OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms. arXiv, 2026.
  2. Chen et al. Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap. arXiv, 2026.
  3. ASMA. An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions. arXiv, 2024.
  4. Blukis et al. Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction. CoRL, 2018.
  5. Raychaudhuri et al. Zero-shot Object-Centric Instruction Following: Integrating Foundation Models with Traditional Navigation. arXiv, 2024.