Hierarchical VLM planning: Let the drone understand instructions such as "land on the east side of Building 3"

In-depth analysis of the application of vision-language-action model (VLA) in UAV path planning, combing the evolution route from single end-to-end to hierarchical semantic planning, covering key work such as RT-2, OpenVLA, Compositional Foundation Models, LangStrands, etc., analyzing why hierarchical architecture is the optimal solution for UAV VLA, and giving implementation guidelines.

Hierarchical VLM planning: Let the drone understand instructions such as โ€œland on the east side of Building 3โ€

1. Question: Why doesnโ€™t โ€œdirect end-to-endโ€ work for drones?

Imagine you give a command to a drone: โ€œGo around Building 2 and land in the open space on the east side of Building 3โ€.

This sentence is simple for humans, but it contains three levels of meaning:

  1. Semantic Layer: Find the locations of Building 2 and Building 3
  2. Spatial Reasoning: Starting from the current location, bypass Building 2 and reach the east side of Building 3
  3. Physical Execution: Generate smooth 3D trajectories and control motors for real-time tracking

If you use pure end-to-end VLA (Vision-Language-Action) to directly output motor control signals from camera images, you will face two fundamental problems:

Therefore, hierarchical VLM planning has become a common choice for industry and academia.

2. Overview of layered architecture

็”จๆˆทๆŒ‡ไปค: "็ป•่ฟ‡2ๅทๆฅผ๏ผŒๅœจ3ๅทๆฅผไธœไพง้™่ฝ"

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 1: ่ฏญไน‰็†่งฃๅฑ‚ (VLM/LLM)               โ”‚
โ”‚ ่พ“ๅ…ฅ: ๆ–‡ๆœฌๆŒ‡ไปค + ๅœฐๅ›พ/ๅ›พๅƒ                   โ”‚
โ”‚ ่พ“ๅ‡บ: "ๅ…ˆๅ‘ๅŒ—้ฃž๏ผŒ็ป•่ฟ‡2ๅทๆฅผ๏ผŒๅ†ๅ‘ไธœ้ฃž..."       โ”‚
โ”‚       (ๅญ็›ฎๆ ‡ๅบๅˆ—)                            โ”‚
โ”‚ ๆจกๅž‹: GPT-4V / LLaVA / Gemini 2.0 Flash    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ†“ ๅญ็›ฎๆ ‡
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 2: ่ฝจ่ฟน่ง„ๅˆ’ๅฑ‚ (MPC / RRT* / ESDF)    โ”‚
โ”‚ ่พ“ๅ…ฅ: ๅฝ“ๅ‰็Šถๆ€ + ๅญ็›ฎๆ ‡                      โ”‚
โ”‚ ่พ“ๅ‡บ: ็ฉบ้—ด่ทฏๅพ„็‚นๅบๅˆ— (x,y,z,t)              โ”‚
โ”‚ ็‰น็‚น: ๆœ‰็†่ฎบๅฎ‰ๅ…จๆ€งไฟ่ฏ๏ผˆๅฏๆบๆ€งใ€็ขฐๆ’žๆฃ€ๆต‹๏ผ‰     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    โ†“ ่ทฏๅพ„็‚น
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 3: ๆŽงๅˆถๆ‰ง่กŒๅฑ‚ (PID / ้ž็บฟๆ€งMPC)        โ”‚
โ”‚ ่พ“ๅ…ฅ: ๆœŸๆœ›่ฝจ่ฟน + ๅฝ“ๅ‰็Šถๆ€                    โ”‚
โ”‚ ่พ“ๅ‡บ: ็”ตๆœบ่ฝฌ้€Ÿ / PWM ไฟกๅท                    โ”‚
โ”‚ ้ข‘็އ: 100-400Hz (ๅฎžๆ—ถ)                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The benefits of this layering: Each layer is decoupled and can be trained/optimized independently using the method most suitable for that layer, without the need for full-link end-to-end training.

3. Detailed explanation of each layer

3.1 Layer 1: Semantic understanding layer - let VLM understand the instructions

This is the โ€œsmartestโ€ layer in the layered architecture, and where large models are most useful.

Core mission:

Key Technology: Visual Grounding

Map an abstract reference in the language (โ€œEast side of Building 3โ€) to a concrete location in the map/image:

# ไผชไปฃ็ ็คบไพ‹
def parse_instruction(instruction: str, map_image: Image, ego_view: Image):
    # Step 1: ็”จ VLM ็†่งฃๆŒ‡ไปคไธญ็š„ๅœฐๆ ‡
    landmarks = vlm.extract_landmarks(
        instruction,  # "็ป•่ฟ‡2ๅทๆฅผ๏ผŒๅœจ3ๅทๆฅผไธœไพง้™่ฝ"
        map_image    # ้ธŸ็žฐๅœฐๅ›พ
    )
    # โ†’ ["building_2 (polygon: [[x1,y1],...])", 
    #    "landing_zone: east_side_of_building_3"]

    # Step 2: ๅฐ†ๅœฐๆ ‡่ฝฌๆขไธบๅๆ ‡
    goal_pose = convert_to_waypoints(landmarks, ego_view)

    # Step 3: ็”Ÿๆˆๅญ็›ฎๆ ‡ๅบๅˆ—
    subgoals = plan_subgoal_sequence(
        current_pose, goal_pose,
        constraints=[avoid(building_2), approach(building_3, side=east)]
    )
    # โ†’ ["fly_north_50m", "turn_east", "descend_20m", "hover_at_landing_zone"]

    return subgoals

Model selection:| Model | Advantages | Disadvantages | Applicable scenarios | |------|------|------|----------| | GPT-4V / GPT-4o | Strong reasoning ability, strong multi-modality | Internet connection required, high latency | Cloud, latency insensitive | | Gemini 2.0 Flash | Free, fast, supports local deployment | General understanding of Chinese instructions | Local edge deployment | | LLaVA 7B | Locally deployable, open source | Weak in understanding complex instructions | Edge drone | | Qwen2-VL | Chinese friendly, open source | Edge deployment needs to be quantified | Domestic scenarios |

Recent Progress (2024-2025):

3.2 Layer 2: Trajectory planning layerโ€”from semantics to spatial paths

This layer receives the sub-goals of the semantic layer and outputs geometric paths.

Classic method (non-learning style):

# RRT* ๅ…จๅฑ€่ทฏๅพ„่ง„ๅˆ’
path = rrt_star(
    start=current_pose,
    goal=subgoal,
    obstacles=building_2_obstacle,  # ๆฅ่‡ช่ฏญไน‰ๅฑ‚็š„่พ“ๅ‡บ
    max_iterations=1000,
    connection_radius=5.0
)

# ESDF ๅฑ€้ƒจ้ฟ้šœ๏ผˆๅฎžๆ—ถ๏ผ‰
esdf_map = build_esdf_from_lidar(ego_view)
safe_direction = esdf_map.gradient_at(current_position)

# MPC ่ฝจ่ฟนไผ˜ๅŒ–
trajectory = mpc.optimize(
    horizon=20,
    dynamics=uav_dynamics,
    obstacles=esdf_map,
    cost=[trajectory_smoothness, progress_to_goal, control_effort]
)

Learning method (reinforcement learning):

# ็ญ–็•ฅ็ฝ‘็ปœ๏ผš่พ“ๅ…ฅๅฝ“ๅ‰็Šถๆ€+็›ฎๆ ‡ โ†’ ่พ“ๅ‡บๆŽงๅˆถๅŠจไฝœ
policy = PPO(
    obs_dim=state_dim,      # ไฝ็ฝฎใ€้€Ÿๅบฆใ€ๅงฟๆ€ใ€้™„่ฟ‘้šœ็ข็‰ฉ
    act_dim=action_dim,     # ้€ŸๅบฆๆŒ‡ไปค (vx, vy, vz)
)

# ๅœจไปฟ็œŸไธญ่ฎญ็ปƒ๏ผŒ้€š่ฟ‡ Domain Randomization ๆๅ‡ๆณ›ๅŒ–ๆ€ง
# DRๅ‚ๆ•ฐ๏ผš้ฃŽ้€Ÿใ€ๅปถ่ฟŸใ€ไผ ๆ„Ÿๅ™จๅ™ชๅฃฐใ€็ฉบๆฐ”่ดจ้‡
env.set_domain_randomization(
    wind=(0, 5),           # m/s
    comm_latency=(0, 100), # ms
    sensor_noise=(0, 0.05) # ๅฝ’ไธ€ๅŒ–ๅ™ชๅฃฐ
)

**Why not replace this layer with pure RL? **

Because pure RL trajectories have no theoretical safety guarantees - RL may find paths that โ€œlook feasible but actually hit a wallโ€. The combination of ESDF + MPC provides a reachability guarantee: As long as MPC can find a solution, it will not hit obstacles.

3.3 Layer 3: Control execution layer - real-time stable tracking

This layer is the most mature, and traditional control theory is completely sufficient:

# ้ž็บฟๆ€ง MPC ๆŽงๅˆถ
class UAVController:
    def __init__(self):
        self.mpc = NMPC(
            horizon=10,          # ้ข„ๆต‹ 10 ๆญฅ (~1็ง’)
            dt=0.1,             # ๆŽงๅˆถๅ‘จๆœŸ 10Hz
            Q=diag([1,1,1]),    # ไฝ็ฝฎ่ฏฏๅทฎๆƒ้‡
            R=diag([0.1,0.1])   # ๆŽงๅˆถ้‡ๆƒ้‡
        )

    def control(self, state, ref_trajectory):
        # ref_trajectory ๆฅ่‡ช Layer 2
        u = self.mpc.solve(state, ref_trajectory)
        return self.motor_mixer.mix(u)  # ่ฝฌๆขไธบ็”ตๆœบ่ฝฌ้€Ÿ

    def safety_check(self, state):
        # ๅฎžๆ—ถๅฎ‰ๅ…จๅ…œๅบ•๏ผšๅฆ‚ๆžœ็Šถๆ€ๅฑ้™ฉ๏ผŒๅผบๅˆถๆ‚ฌๅœ
        if state.altitude < 2.0 and state.speed < 0.5:
            return "LANDING"
        return "FLY"

4. Key Research Work

4.1 Compositional Foundation Models for Hierarchical Planning

Ajay et al., arXiv:2309.08587 (2023)

This article is very important basic work and proposes the concept of โ€œcombined basic modelโ€:- Core idea: Use multiple dedicated basic models to combine, each layer does one thing, and use combination to complete complex tasks

Why itโ€™s inspiring for UAV: Hierarchical planning allows each layer to independently reuse pre-trained models, without the need to retrain the entire system for UAV scenarios.

4.2 LangStrands โ€” Natural language controlled robot

LangStrands (2024) โ€” Use natural language to control robots to perform reconnaissance/operation tasks in industrial scenarios:

-Supports complex instructions: โ€œCheck the equipment in area A first, and if any abnormality is found, report to location Bโ€

References from UAVs: LangStrandsโ€™ mission map analysis ideas can be directly transferred to UAVs, such as complex tasks such as โ€œfirst reconnaissance of 5 target points, and then return to baseโ€.

4.3 Embodied Tree of Thoughts โ€” World Model Assisted Planning

Xu et al., arXiv:2512.08188 (2025)

Value to UAVs: When the UAV is flying in the air, World Model can predict the impact of gusts, the hovering ability after battery decay, and plan a safer trajectory in advance.

4.4 OpenVLA โ€” Open Source Robot VLA

OpenVLA (2024) โ€” Open source VLA model released by UC Berkeley:

Ni et al., arXiv:2509.15273 (2025)

Importance: With a unified evaluation platform, each layer of the layered architecture can be independently evaluated, and optimization can be based on evidence.

5. Sim2Real: How to transfer trained strategies to real drones

A key advantage of the layered architecture: Each layer can be Sim2Real independently, eliminating the need for full-link end-to-end migration.

5.1 Sim2Real difficulty analysis of each layer

LevelTraining environmentMigration difficultyCore challenges
Layer 1 (VLM)Any image/mapLowVLM has been pre-trained and has strong generalization
Layer 2 (RL)AirSim / FlightmareMediumAerodynamic parameters mismatch
Layer 3 (MPC)Real drone parameter adjustmentLowJust calibrate the motor parameters

5.2 Sim2Real Strategy for Layer 2

Domain Randomization (DR):

# ไปฟ็œŸ่ฎญ็ปƒๆ—ถ้šๆœบๅŒ–ๅ…ณ้”ฎ็‰ฉ็†ๅ‚ๆ•ฐ
class SimEnv:
    def reset(self):
        self.wind = random.uniform(-3, 3)      # m/s ้˜ต้ฃŽ
        self.motor_lag = random.uniform(0.8, 1.2)  # ็”ตๆœบๅ“ๅบ”็ณปๆ•ฐ
        self.battery_level = random.uniform(0.7, 1.0)  # ็”ตๆฑ ็Šถๆ€
        self.gps_noise = random.uniform(0, 0.5)  # GPS ๅ™ชๅฃฐ (m)

**Real2Sim Calibration (real machine calibration): **

# ๅœจ็œŸๅฎžๆ— ไบบๆœบไธŠ่ท‘็ณป็ปŸ่พจ่ฏ†
def calibrate_dynamics(real_uav):
    # ๆฟ€ๅŠฑไฟกๅท๏ผš้˜ถ่ทƒ่พ“ๅ…ฅ
    for amplitude in [0.1, 0.3, 0.5]:
        response = real_uav.step_input(thrust=amplitude)
        # ๆ‹Ÿๅˆ็œŸๅฎž็”ตๆœบๅ“ๅบ”ๆ›ฒ็บฟ
        motor_model.fit(step_responses)

    # ๅฐ†ๆ ‡ๅฎšๅ‚ๆ•ฐๅ†™ๅ›žไปฟ็œŸ
    sim_env.set_dynamics(motor_model.params)

5.3 Practical case: MADERโ€™s Sim2Real

MADER (Multi-Agent DEep Reinforcement learning for aerial swarms) is one of the best works done by UAV Sim2Real in recent years:

็กฌไปถ:                      ่ฝฏไปถ:
- Pixhawk ้ฃžๆŽง (ๆˆ– Crazyflie)   - PX4 / ArduPilot ๅ›บไปถ
- Jetson Orin NX (่พน็ผ˜่ฎก็ฎ—)     - ROS 2 Humble
- Livox ๆฟ€ๅ…‰้›ท่พพ / RealSense    - ๆทฑๅบฆ็›ธๆœบ + IMU

่ฝฏไปถๅˆ†ๅฑ‚:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ VLM (Layer 1): LLaVA 7B/Qwen2-VL โ”‚
โ”‚ ๆŽจ็†ๅผ•ๆ“Ž: llm.cpp / vLLM          โ”‚
โ”‚ ๆŽจ็†็กฌไปถ: Jetson Orin NX (INT8)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ่ง„ๅˆ’ๅ™จ (Layer 2):                โ”‚
โ”‚ - ๅ…จๅฑ€: RRT* (OMPL)             โ”‚
โ”‚ - ๅฑ€้ƒจ: OSQP / Crocoddyl (MPC)   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๆŽงๅˆถๅ™จ (Layer 3): PX4 SITL /     โ”‚
โ”‚              Ardupilot guided mode โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6.2 ROS 2 message interface

# Layer 1 โ†’ Layer 2 ๆถˆๆฏ
class SubgoalMsg(Message):
    position: Point  # ็›ฎๆ ‡็‚น
    constraints: List[Constraint]  # ้ฟ้šœ็บฆๆŸ
    priority: int  # ไผ˜ๅ…ˆ็บง
    timeout: float  # ่ถ…ๆ—ถๆ—ถ้—ด

# Layer 2 โ†’ Layer 3 ๆถˆๆฏ
class TrajectoryMsg(Message):
    waypoints: List[PoseStamped]  # ่ทฏๅพ„็‚นๅบๅˆ—
    velocities: List[float]       # ๆœŸๆœ›้€Ÿๅบฆ
    start_time: Time             # ่ฎกๅˆ’ๅผ€ๅง‹ๆ—ถ้—ด

6.3 Delay budget (real-time guarantee)

ๆ€ปๅปถ่ฟŸ้ข„็ฎ—: < 500ms (ๅฏๆŽฅๅ—)
โ”œโ”€ ๅ›พๅƒ้‡‡้›†: 30ms (30fps)
โ”œโ”€ VLM ๆŽจ็†: 200-400ms (LLaVA 7B @ INT8)
โ”œโ”€ ่ฝจ่ฟน่ง„ๅˆ’: 50ms (RRT* + MPC)
โ””โ”€ ๆŽงๅˆถๅ™จ่ทŸ่ธช: ๅฎžๆ—ถ (100Hz)

If VLM inference is too slow, you can:

  1. Using streaming reasoning, the planning layer can get the intermediate results in advance
  2. Use lightweight model (LLaVA 3B / Qwen2-VL 2B)
  3. Cache the planning results of commonly used instructions (valid when the map is fixed)

7. Current challenges and future directions

7.1 Core Challenges

  1. VLM inference latency: LLaVA 7B infers about 200-400ms on edge GPU, exceeding the requirements for security response
  2. Command ambiguity: The ambiguous command โ€œland in a safe placeโ€ is difficult for VLM to handle
  3. Multi-layer error accumulation: Semantic error of Layer 1 โ†’ Path deviation of Layer 2 โ†’ Control jitter of Layer 3
  4. Dynamic Obstacles: Layer 2โ€™s ESDF map update frequency cannot keep up with high-speed obstacles (such as flying birds)

7.2 Future Directions

8. Summary

Hierarchical VLM planning is currently the most feasible route for UAV intelligence:

The core advantage of this architecture: Let the big model do what it is good at (understanding), let the classic method do what it is good at (security planning), each performing their own duties, instead of using a black box end-to-end model to bear all risks.


*References (in order of citation in the text)*1. Ajay et al., โ€œCompositional Foundation Models for Hierarchical Planningโ€, arXiv:2309.08587, 2023 2. Padalkar et al., โ€œOpenVLA: Open-Source Vision-Language-Action Modelโ€, 2024 3. Liu et al., โ€œAligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AIโ€, arXiv:2407.06886, 2024 4. Xu et al., โ€œEmbodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Modelโ€, arXiv:2512.08188, 2025 5. Ni et al., โ€œEmbodied Arena: A Comprehensive Evaluation Platform for Embodied AIโ€, arXiv:2509.15273, 2025 6. Mu et al., โ€œEmbodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimizationโ€, arXiv:2512.20902, 2025 7. Zhou et al., โ€œOmniShow: Unifying Multimodal Conditions for Human-Object Interactionโ€, arXiv:2604.11804, 2026

Author: Kagura Tart | 2026-04-15