LLM-Guided UAV mission planning: the frontier from inference to execution

In-depth analysis of the three major paradigms of LLM for UAV mission planning: LLM as Planner, LLM+PDDL symbol planning, and LLM+RAG, covering cutting-edge work such as VoxPoser, ActiveGAMER, and dual-process architecture.

LLM-Guided UAV Mission Planning: The Frontier from Inference to Execution

UAV Intelligent Series ยท Chapter X+1 Spotlight: LLM as mission planner, symbolic planning integration, real-time inference architecture


1. Why is LLM suitable for UAV mission planning?

The challenge of UAV mission planning lies in open world uncertainty:

ไผ ็ปŸ่ง„ๅˆ’๏ผˆๅŸบไบŽๆจกๅž‹๏ผ‰๏ผš
่พ“ๅ…ฅ๏ผš็ฒพ็กฎ็›ฎๆ ‡็Šถๆ€ + ็ฒพ็กฎ็Žฏๅขƒๆจกๅž‹
่พ“ๅ‡บ๏ผšๆœ€ไผ˜ๅŠจไฝœๅบๅˆ—
ๅฑ€้™๏ผšๆจกๅž‹ไธๅ‡†ๅฐฑๅดฉๆบƒ๏ผŒๆ— ๆณ•ๅค„็†่ฏญ่จ€็›ฎๆ ‡

LLM ่ง„ๅˆ’๏ผˆๅŸบไบŽ็Ÿฅ่ฏ†๏ผ‰๏ผš
่พ“ๅ…ฅ๏ผš่‡ช็„ถ่ฏญ่จ€ๆŒ‡ไปค + ่ง†่ง‰่ง‚ๆต‹ + ไธ–็•Œ็Ÿฅ่ฏ†
่พ“ๅ‡บ๏ผšๅฏๆ‰ง่กŒๅŠจไฝœๅบๅˆ—
ไผ˜ๅŠฟ๏ผšๆณ›ๅŒ–ๆ€งๅผบใ€้›ถๆ ทๆœฌ็†่งฃๆ–ฐไปปๅŠก

Advantages of LLM:


2. LLMโ€™s paradigm for task planning

2.1 Paradigm 1: LLM as Planner (directly output actions)

Representative work:

ReAct (Reasoning + Acting)

SayCan (PaLM-SayCan, 2022)


2.2 Paradigm 2: LLM + PDDL symbol planning

PDDL (Planning Domain Definition Language) is a classic robot task planning language that models tasks as discrete symbolic problems.

Core idea:

VLM ๆ„Ÿ็Ÿฅ โ†’ PDDL problem ็”Ÿๆˆ โ†’ ็ปๅ…ธ่ง„ๅˆ’ๅ™จ โ†’ UAV ๅŠจไฝœๅบๅˆ—

Advantages:

Challenge:

---### 2.3 Paradigm 3: LLM + RAG (retrieval enhanced generation)

GenerativeMPC (arXiv, 2026)

Paper: GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation Author: Marcelino Julio Fernando et al. Source: arXiv, April 2026

Core idea:

VLM ๆ„Ÿ็Ÿฅๅฝ“ๅ‰ๅœบๆ™ฏ โ†’ ๆฃ€็ดข็›ธๅ…ณๆ“ไฝœ็Ÿฅ่ฏ†ๅบ“ โ†’ RAG ็”Ÿๆˆๆ“ไฝœๅปบ่ฎฎ โ†’ MPC ๆ‰ง่กŒ

Key technology:

  1. Knowledge retrieval: Retrieve examples most relevant to the current scenario from the operational knowledge base (including robot control experience data)
  2. Virtual Impedance: Generate compliance control parameters to avoid rigid collisions
  3. RAG filtering: Ensure that LLM output is physically executable

Adaptation on UAV:


3. Real-time reasoning architecture

3.1 Dual-process architecture (arXiv, 2026)

Paper: A Dual-Process Architecture for Real-Time VLM-Based Indoor Navigation Author: Joonhee Lee, Hyunseung Shin, Jeonggil Ko Source: arXiv:2601.19401, January 2026

Core Design:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           System Architecture               โ”‚
โ”‚                                             โ”‚
โ”‚  Process 1 (Slow): VLM Reasoning Thread     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ VLM: "What should I do next?"       โ”‚   โ”‚
โ”‚  โ”‚ Frequency: ~0.2-1 Hz                 โ”‚   โ”‚
โ”‚  โ”‚ Output: Navigation goal / decision  โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚              โ†“ goal                        โ”‚
โ”‚  Process 2 (Fast): Control Execution Threadโ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ MPC: Track trajectory to goal        โ”‚   โ”‚
โ”‚  โ”‚ Frequency: ~100 Hz                   โ”‚   โ”‚
โ”‚  โ”‚ Output: Motor control signals        โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Design principles:


3.2 Hierarchical planning framework

**High level (LLM/VLM, second level): **

ไปปๅŠก็†่งฃ โ†’ ๅญ็›ฎๆ ‡ๅˆ†่งฃ โ†’ ๅ…จๅฑ€่ทฏๅพ„่ง„ๅˆ’ โ†’ ๆŽˆๆƒไฝŽๅฑ‚ๆ‰ง่กŒ

**Middle layer (differentiable optimization, 100ms level): **

RRT*/MPC โ†’ ๅฑ€้ƒจ่ทฏๅพ„้‡่ง„ๅˆ’ โ†’ ๅนณๆป‘่ฝจ่ฟน็”Ÿๆˆ
```**Low layer (PID/MPC, millisecond level): **

ๅงฟๆ€ๆŽงๅˆถ โ†’ ็”ตๆœบๅˆ†้… โ†’ ๆ‰ง่กŒ


---

## 4. Key algorithm depth

### 4.1 VoxPoser: LLM synthetic 3D value map

**Paper:** *VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models*
**Author:** Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Li Fei-Fei
**Source:** arXiv:2307.05973, July 2023

**Core contribution:**
- LLM output **3D spatial heat map** (composable 3D value map)
- Heat map encoding "where to go" and "what to avoid"
- Directly used as reward function for trajectory optimization

**Extension on UAV:**
- VLM output 3D occupancy heat map
- Heat map driven MPC cost function
- VoxPoser for UAV = "3D spatial affordance from language"

**Note:** VoxPoser was published on arXiv. No clear conference publication records have been found so far.

---

### 4.2 CoNVO (Conditional Neural Value Optimization)

Combine LLM planning with value iteration:
- LLM provides **prior preferences** (which actions are more reasonable)
- Value iteration provides **optimality guarantee**
- More robust than pure LLM planning and more flexible than pure planning

---

## 5. World model assisted planning

### 5.1 Why World Model?

The knowledge of the LLM is static, but the UAV environment is dynamic:
- The wind will change
- Obstacles will move
- GNSS signals can drift

The World Model allows UAVs to **predict the future**: 

ๅฝ“ๅ‰็Šถๆ€ + ๅŠจไฝœ โ†’ ไธ–็•Œๆจกๅž‹ โ†’ ้ข„ๆต‹ๆœชๆฅ็Šถๆ€ๅบๅˆ— LLM ๅœจ้ข„ๆต‹็š„ๆœชๆฅ็Šถๆ€ๅบๅˆ—ไธŠๅš่ง„ๅˆ’๏ผˆPlan over imagined futures๏ผ‰


### 5.2 Paper Representative**Dreamer Series** (Daniel Hafner, Jรผrg Widmer, etc.)
- Based on RSSM dynamic model
- Do reinforcement learning on imagined future
- Verified on robots (robot arms, unmanned vehicles)

**VMP (Video Motion Planning)**
- Use video generation models for motion planning
- Generate future frames โ†’ extract motion vectors โ†’ control UAV

---

## 6. Security and Authentication

### 6.1 Why security is key

When UAVs fly in cities, poor decision-making may cause **human casualties**. There is a fundamental contradiction between the probabilistic output of LLM and the deterministic guarantees required by aviation safety.

### 6.2 Security Framework

**CBF๏ผˆControl Barrier Functions๏ผ‰๏ผš**
- ASMA introduces CBF to UAV VLN
- Ensure that the unsafe state is never reachable

**Formal Verification๏ผš**
- Use TLA+ / NuSMV for state machine verification
- LLM planning results are executed after model verification

**Shielding:**
- Bottom layer protector (Shield): monitors LLM output and intercepts unsafe actions
- Upper-level LLM: Focus on task completion and do not consider security details
- **Autonomous driving-like "Guardian Angel" architecture**

---

## 7. Frontier hot spots and future directions

### 7.1 End-to-end VLA (Vision-Language-Action)

**Latest trend:** Skip the hierarchical design of "sensing โ†’ planning โ†’ control" and output **action token** directly from VLM.

Representative work:
- **RT-2** (Google Robotics): Directly fine-tune the output action of VLM
- **ฯ€โ‚€** (Physical Intelligence): VLA for humanoid robots
- **UAV version** (emerging): similar ideas applied to drones

**Challenge:**
- Continuity of action space vs discreteness of language
- Difficulty in security verification (end-to-end black box)
- Data scarcity (requires large-scale robot teleoperation data)

### 7.2 Multi-machine collaborative LLM planning

**SysNav (arXiv, March 2026)****Paper:** *SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation*
**Author:** Haokun Zhu et al.
**Source:** arXiv:2603.xxxxx, March 2026

**Core contribution:**
- Multi-agent collaborative navigation across different robot platforms
- LLM does high-level coordination (who goes to which area)
- Distributed perception fusion (each agent shares vision)

### 7.3 Physical Intelligence ร— UAV

- **Foundation Models for Manipulation** โ†’ **Foundation Models for Flight**
- A dedicated "UAV brain" pre-training model may appear in the future
- Similar to LLaVA but specializing in 3D spatial reasoning + flight dynamics

---

## 8. Summary and suggestions

| Dimensions | Current Best | Future Directions |
|------|---------|---------|
| Planning paradigm | Dual-process architecture (real-time feasible) | End-to-end VLA (long-term goal) |
| World knowledge | RAG (reliable but slow) | World model (fast but requires training) |
| Security | CBF + Shielding | Formal verification (fully guaranteed) |
| Edge deployment | 4-bit LLaVA (barely real-time) | Special purpose chips (NPU/TPU) |

**Advice for you:**
1. **The fastest route to results**: Dual-process architecture + LLaVA-7B + UAV platform
2. **The most room for innovation**: VLM + security verification framework (almost no one is doing it currently)
3. **Long-term layout**: Collect your own UAV control data and train a dedicated VLA model

---

## ๐Ÿ“š References1. Lee et al. *A Dual-Process Architecture for Real-Time VLM-Based Indoor Navigation*. arXiv:2601.19401, 2026.
2. Fernando et al. *GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance*. arXiv, 2026.
3. Huang et al. *VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models*. arXiv:2307.05973, 2023.
4. Brohan et al. *RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control*. arXiv, 2023.
5. Zhu et al. *SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation*. arXiv, 2026.
6. Ahn et al. *Do As I Can and Not As I Say: Grounding Language in Robotic Affordances*. arXiv, 2022.