Hierarchical VLM planning: Let the drone understand instructions such as โland on the east side of Building 3โ
1. Question: Why doesnโt โdirect end-to-endโ work for drones?
Imagine you give a command to a drone: โGo around Building 2 and land in the open space on the east side of Building 3โ.
This sentence is simple for humans, but it contains three levels of meaning:
- Semantic Layer: Find the locations of Building 2 and Building 3
- Spatial Reasoning: Starting from the current location, bypass Building 2 and reach the east side of Building 3
- Physical Execution: Generate smooth 3D trajectories and control motors for real-time tracking
If you use pure end-to-end VLA (Vision-Language-Action) to directly output motor control signals from camera images, you will face two fundamental problems:
- Output frequency mismatch: VLAโs language model inference takes hundreds of milliseconds at a time, but drone control requires 100Hz+ real-time signals
- Uncontrollable security: The end-to-end black box model cannot guarantee that the generated trajectory meets the physical constraints. What should I do if I hit a building?
Therefore, hierarchical VLM planning has become a common choice for industry and academia.
2. Overview of layered architecture
็จๆทๆไปค: "็ป่ฟ2ๅทๆฅผ๏ผๅจ3ๅทๆฅผไธไพง้่ฝ"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 1: ่ฏญไน็่งฃๅฑ (VLM/LLM) โ
โ ่พๅ
ฅ: ๆๆฌๆไปค + ๅฐๅพ/ๅพๅ โ
โ ่พๅบ: "ๅ
ๅๅ้ฃ๏ผ็ป่ฟ2ๅทๆฅผ๏ผๅๅไธ้ฃ..." โ
โ (ๅญ็ฎๆ ๅบๅ) โ
โ ๆจกๅ: GPT-4V / LLaVA / Gemini 2.0 Flash โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๅญ็ฎๆ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 2: ่ฝจ่ฟน่งๅๅฑ (MPC / RRT* / ESDF) โ
โ ่พๅ
ฅ: ๅฝๅ็ถๆ + ๅญ็ฎๆ โ
โ ่พๅบ: ็ฉบ้ด่ทฏๅพ็นๅบๅ (x,y,z,t) โ
โ ็น็น: ๆ็่ฎบๅฎๅ
จๆงไฟ่ฏ๏ผๅฏๆบๆงใ็ขฐๆๆฃๆต๏ผ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ่ทฏๅพ็น
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Layer 3: ๆงๅถๆง่กๅฑ (PID / ้็บฟๆงMPC) โ
โ ่พๅ
ฅ: ๆๆ่ฝจ่ฟน + ๅฝๅ็ถๆ โ
โ ่พๅบ: ็ตๆบ่ฝฌ้ / PWM ไฟกๅท โ
โ ้ข็: 100-400Hz (ๅฎๆถ) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The benefits of this layering: Each layer is decoupled and can be trained/optimized independently using the method most suitable for that layer, without the need for full-link end-to-end training.
3. Detailed explanation of each layer
3.1 Layer 1: Semantic understanding layer - let VLM understand the instructions
This is the โsmartestโ layer in the layered architecture, and where large models are most useful.
Core mission:
- Parse natural language instructions to extract key landmarks and constraints (โBuilding 2โ, โEast Sideโ, โBypassโ)
- Align 2D/3D map information with visual observations
- Output semantic subgoal sequence (subgoal list)
Key Technology: Visual Grounding
Map an abstract reference in the language (โEast side of Building 3โ) to a concrete location in the map/image:
# ไผชไปฃ็ ็คบไพ
def parse_instruction(instruction: str, map_image: Image, ego_view: Image):
# Step 1: ็จ VLM ็่งฃๆไปคไธญ็ๅฐๆ
landmarks = vlm.extract_landmarks(
instruction, # "็ป่ฟ2ๅทๆฅผ๏ผๅจ3ๅทๆฅผไธไพง้่ฝ"
map_image # ้ธ็ฐๅฐๅพ
)
# โ ["building_2 (polygon: [[x1,y1],...])",
# "landing_zone: east_side_of_building_3"]
# Step 2: ๅฐๅฐๆ ่ฝฌๆขไธบๅๆ
goal_pose = convert_to_waypoints(landmarks, ego_view)
# Step 3: ็ๆๅญ็ฎๆ ๅบๅ
subgoals = plan_subgoal_sequence(
current_pose, goal_pose,
constraints=[avoid(building_2), approach(building_3, side=east)]
)
# โ ["fly_north_50m", "turn_east", "descend_20m", "hover_at_landing_zone"]
return subgoals
Model selection:| Model | Advantages | Disadvantages | Applicable scenarios | |------|------|------|----------| | GPT-4V / GPT-4o | Strong reasoning ability, strong multi-modality | Internet connection required, high latency | Cloud, latency insensitive | | Gemini 2.0 Flash | Free, fast, supports local deployment | General understanding of Chinese instructions | Local edge deployment | | LLaVA 7B | Locally deployable, open source | Weak in understanding complex instructions | Edge drone | | Qwen2-VL | Chinese friendly, open source | Edge deployment needs to be quantified | Domestic scenarios |
Recent Progress (2024-2025):
- LLaVA-Plan โ Adds Planning Head based on LLaVA, specializing in task decomposition
- GPT-4o Live Voice โ end-to-end voice command understanding, no ASR required, no interruptions
3.2 Layer 2: Trajectory planning layerโfrom semantics to spatial paths
This layer receives the sub-goals of the semantic layer and outputs geometric paths.
Classic method (non-learning style):
# RRT* ๅ
จๅฑ่ทฏๅพ่งๅ
path = rrt_star(
start=current_pose,
goal=subgoal,
obstacles=building_2_obstacle, # ๆฅ่ช่ฏญไนๅฑ็่พๅบ
max_iterations=1000,
connection_radius=5.0
)
# ESDF ๅฑ้จ้ฟ้๏ผๅฎๆถ๏ผ
esdf_map = build_esdf_from_lidar(ego_view)
safe_direction = esdf_map.gradient_at(current_position)
# MPC ่ฝจ่ฟนไผๅ
trajectory = mpc.optimize(
horizon=20,
dynamics=uav_dynamics,
obstacles=esdf_map,
cost=[trajectory_smoothness, progress_to_goal, control_effort]
)
Learning method (reinforcement learning):
# ็ญ็ฅ็ฝ็ป๏ผ่พๅ
ฅๅฝๅ็ถๆ+็ฎๆ โ ่พๅบๆงๅถๅจไฝ
policy = PPO(
obs_dim=state_dim, # ไฝ็ฝฎใ้ๅบฆใๅงฟๆใ้่ฟ้็ข็ฉ
act_dim=action_dim, # ้ๅบฆๆไปค (vx, vy, vz)
)
# ๅจไปฟ็ไธญ่ฎญ็ป๏ผ้่ฟ Domain Randomization ๆๅๆณๅๆง
# DRๅๆฐ๏ผ้ฃ้ใๅปถ่ฟใไผ ๆๅจๅชๅฃฐใ็ฉบๆฐ่ดจ้
env.set_domain_randomization(
wind=(0, 5), # m/s
comm_latency=(0, 100), # ms
sensor_noise=(0, 0.05) # ๅฝไธๅๅชๅฃฐ
)
**Why not replace this layer with pure RL? **
Because pure RL trajectories have no theoretical safety guarantees - RL may find paths that โlook feasible but actually hit a wallโ. The combination of ESDF + MPC provides a reachability guarantee: As long as MPC can find a solution, it will not hit obstacles.
3.3 Layer 3: Control execution layer - real-time stable tracking
This layer is the most mature, and traditional control theory is completely sufficient:
# ้็บฟๆง MPC ๆงๅถ
class UAVController:
def __init__(self):
self.mpc = NMPC(
horizon=10, # ้ขๆต 10 ๆญฅ (~1็ง)
dt=0.1, # ๆงๅถๅจๆ 10Hz
Q=diag([1,1,1]), # ไฝ็ฝฎ่ฏฏๅทฎๆ้
R=diag([0.1,0.1]) # ๆงๅถ้ๆ้
)
def control(self, state, ref_trajectory):
# ref_trajectory ๆฅ่ช Layer 2
u = self.mpc.solve(state, ref_trajectory)
return self.motor_mixer.mix(u) # ่ฝฌๆขไธบ็ตๆบ่ฝฌ้
def safety_check(self, state):
# ๅฎๆถๅฎๅ
จๅ
ๅบ๏ผๅฆๆ็ถๆๅฑ้ฉ๏ผๅผบๅถๆฌๅ
if state.altitude < 2.0 and state.speed < 0.5:
return "LANDING"
return "FLY"
4. Key Research Work
4.1 Compositional Foundation Models for Hierarchical Planning
Ajay et al., arXiv:2309.08587 (2023)
This article is very important basic work and proposes the concept of โcombined basic modelโ:- Core idea: Use multiple dedicated basic models to combine, each layer does one thing, and use combination to complete complex tasks
- Architecture: visual encoder + language model + action decoder, hierarchical cascading
- Experiment: Verified on a robot arm, proving that layering generalizes better than pure end-to-end
Why itโs inspiring for UAV: Hierarchical planning allows each layer to independently reuse pre-trained models, without the need to retrain the entire system for UAV scenarios.
4.2 LangStrands โ Natural language controlled robot
LangStrands (2024) โ Use natural language to control robots to perform reconnaissance/operation tasks in industrial scenarios:
-Supports complex instructions: โCheck the equipment in area A first, and if any abnormality is found, report to location Bโ
- Parse instructions into Task Graph, supporting conditional branches and loops
- Supports multi-robot collaboration, each robot receives different subtasks
References from UAVs: LangStrandsโ mission map analysis ideas can be directly transferred to UAVs, such as complex tasks such as โfirst reconnaissance of 5 target points, and then return to baseโ.
4.3 Embodied Tree of Thoughts โ World Model Assisted Planning
Xu et al., arXiv:2512.08188 (2025)
- Use World Model to predict the physical consequences (environmental state changes) after action execution
- Use Tree of Thoughts to search for the optimal sub-goal sequence before execution
- More physics-grounded than pure VLM planning, avoiding โlooks right but is physically impossibleโ paths
Value to UAVs: When the UAV is flying in the air, World Model can predict the impact of gusts, the hovering ability after battery decay, and plan a safer trajectory in advance.
4.4 OpenVLA โ Open Source Robot VLA
OpenVLA (2024) โ Open source VLA model released by UC Berkeley:
- 7B parameters, supporting 97 types of robot actions
- Trained on 220,000 real robot data
- Can run on consumer GPU (RTX 3090)
- Potential for UAV: Although currently OpenVLA is mainly targeted at robotic arms, the architecture of VLA (visual coding + LLM + action head) can be completely migrated to UAV scenarios### 4.5 Embodied Arena โ Embodied Intelligence Unified Evaluation Platform
Ni et al., arXiv:2509.15273 (2025)
- Covers 250+ embodied intelligence tasks and unified evaluation standards
- Including indoor navigation, operation, aerial flight and other task types
- Provides performance benchmarks for UAV layering (accuracy, latency, success rate for each layer)
Importance: With a unified evaluation platform, each layer of the layered architecture can be independently evaluated, and optimization can be based on evidence.
5. Sim2Real: How to transfer trained strategies to real drones
A key advantage of the layered architecture: Each layer can be Sim2Real independently, eliminating the need for full-link end-to-end migration.
5.1 Sim2Real difficulty analysis of each layer
| Level | Training environment | Migration difficulty | Core challenges |
|---|---|---|---|
| Layer 1 (VLM) | Any image/map | Low | VLM has been pre-trained and has strong generalization |
| Layer 2 (RL) | AirSim / Flightmare | Medium | Aerodynamic parameters mismatch |
| Layer 3 (MPC) | Real drone parameter adjustment | Low | Just calibrate the motor parameters |
5.2 Sim2Real Strategy for Layer 2
Domain Randomization (DR):
# ไปฟ็่ฎญ็ปๆถ้ๆบๅๅ
ณ้ฎ็ฉ็ๅๆฐ
class SimEnv:
def reset(self):
self.wind = random.uniform(-3, 3) # m/s ้ต้ฃ
self.motor_lag = random.uniform(0.8, 1.2) # ็ตๆบๅๅบ็ณปๆฐ
self.battery_level = random.uniform(0.7, 1.0) # ็ตๆฑ ็ถๆ
self.gps_noise = random.uniform(0, 0.5) # GPS ๅชๅฃฐ (m)
**Real2Sim Calibration (real machine calibration): **
# ๅจ็ๅฎๆ ไบบๆบไธ่ท็ณป็ป่พจ่ฏ
def calibrate_dynamics(real_uav):
# ๆฟๅฑไฟกๅท๏ผ้ถ่ท่พๅ
ฅ
for amplitude in [0.1, 0.3, 0.5]:
response = real_uav.step_input(thrust=amplitude)
# ๆๅ็ๅฎ็ตๆบๅๅบๆฒ็บฟ
motor_model.fit(step_responses)
# ๅฐๆ ๅฎๅๆฐๅๅไปฟ็
sim_env.set_dynamics(motor_model.params)
5.3 Practical case: MADERโs Sim2Real
MADER (Multi-Agent DEep Reinforcement learning for aerial swarms) is one of the best works done by UAV Sim2Real in recent years:
- Use MADDPG in AirSim to train multi-machine coordinated obstacle avoidance strategies
- Key Trick: Add sensor delay (20-50ms) during training to let the strategy learn to work under delay
- Results: Zero sample migration to real Tello drone, obstacle avoidance success rate > 85%
6. Engineering implementation: building a hierarchical VLM drone from scratch### 6.1 Recommended technology stack
็กฌไปถ: ่ฝฏไปถ:
- Pixhawk ้ฃๆง (ๆ Crazyflie) - PX4 / ArduPilot ๅบไปถ
- Jetson Orin NX (่พน็ผ่ฎก็ฎ) - ROS 2 Humble
- Livox ๆฟๅ
้ท่พพ / RealSense - ๆทฑๅบฆ็ธๆบ + IMU
่ฝฏไปถๅๅฑ:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VLM (Layer 1): LLaVA 7B/Qwen2-VL โ
โ ๆจ็ๅผๆ: llm.cpp / vLLM โ
โ ๆจ็็กฌไปถ: Jetson Orin NX (INT8) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ่งๅๅจ (Layer 2): โ
โ - ๅ
จๅฑ: RRT* (OMPL) โ
โ - ๅฑ้จ: OSQP / Crocoddyl (MPC) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๆงๅถๅจ (Layer 3): PX4 SITL / โ
โ Ardupilot guided mode โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
6.2 ROS 2 message interface
# Layer 1 โ Layer 2 ๆถๆฏ
class SubgoalMsg(Message):
position: Point # ็ฎๆ ็น
constraints: List[Constraint] # ้ฟ้็บฆๆ
priority: int # ไผๅ
็บง
timeout: float # ่ถ
ๆถๆถ้ด
# Layer 2 โ Layer 3 ๆถๆฏ
class TrajectoryMsg(Message):
waypoints: List[PoseStamped] # ่ทฏๅพ็นๅบๅ
velocities: List[float] # ๆๆ้ๅบฆ
start_time: Time # ่ฎกๅๅผๅงๆถ้ด
6.3 Delay budget (real-time guarantee)
ๆปๅปถ่ฟ้ข็ฎ: < 500ms (ๅฏๆฅๅ)
โโ ๅพๅ้้: 30ms (30fps)
โโ VLM ๆจ็: 200-400ms (LLaVA 7B @ INT8)
โโ ่ฝจ่ฟน่งๅ: 50ms (RRT* + MPC)
โโ ๆงๅถๅจ่ท่ธช: ๅฎๆถ (100Hz)
If VLM inference is too slow, you can:
- Using streaming reasoning, the planning layer can get the intermediate results in advance
- Use lightweight model (LLaVA 3B / Qwen2-VL 2B)
- Cache the planning results of commonly used instructions (valid when the map is fixed)
7. Current challenges and future directions
7.1 Core Challenges
- VLM inference latency: LLaVA 7B infers about 200-400ms on edge GPU, exceeding the requirements for security response
- Command ambiguity: The ambiguous command โland in a safe placeโ is difficult for VLM to handle
- Multi-layer error accumulation: Semantic error of Layer 1 โ Path deviation of Layer 2 โ Control jitter of Layer 3
- Dynamic Obstacles: Layer 2โs ESDF map update frequency cannot keep up with high-speed obstacles (such as flying birds)
7.2 Future Directions
- Multi-modal command fusion: Joint understanding of voice + gesture + gaze point, backup when a single modality fails
- Lifelong Learning: Update Layer 2 strategies online during flight to adapt to unknown environments
- Multi-drone cooperative VLM: One VLM coordinates multiple drones instead of each operating independently
- World Model Prediction: Use a generative model to predict air traffic flow in the next 5 seconds and avoid it in advance
8. Summary
Hierarchical VLM planning is currently the most feasible route for UAV intelligence:
- Layer 1 (VLM) is responsible for semantic understanding and calling cloud or edge large models
- Layer 2 (Planner) is responsible for the conversion from semantics to geometry and can be combined with RL + MPC
- Layer 3 (Controller) is responsible for real-time tracking, and traditional control theory is fully sufficient.
The core advantage of this architecture: Let the big model do what it is good at (understanding), let the classic method do what it is good at (security planning), each performing their own duties, instead of using a black box end-to-end model to bear all risks.
*References (in order of citation in the text)*1. Ajay et al., โCompositional Foundation Models for Hierarchical Planningโ, arXiv:2309.08587, 2023 2. Padalkar et al., โOpenVLA: Open-Source Vision-Language-Action Modelโ, 2024 3. Liu et al., โAligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AIโ, arXiv:2407.06886, 2024 4. Xu et al., โEmbodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Modelโ, arXiv:2512.08188, 2025 5. Ni et al., โEmbodied Arena: A Comprehensive Evaluation Platform for Embodied AIโ, arXiv:2509.15273, 2025 6. Mu et al., โEmbodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimizationโ, arXiv:2512.20902, 2025 7. Zhou et al., โOmniShow: Unifying Multimodal Conditions for Human-Object Interactionโ, arXiv:2604.11804, 2026
Author: Kagura Tart | 2026-04-15