Paper E Experimental Task Book v2: Verification and Error Correction UAV Language Planning for AAAI

This file still uses the file name paper-e-vera-uav-experiment-taskbook-v1-20260517.md because this round requires “direct modification on the V1 version”. The text, title and release notes have all been upgraded to v2. This article is not a final paper draft, but an executable experimental task statement: clarify the research positioning of Paper E, real citable documents, algorithm solutions, data construction, comparative experiments, ablation experiments, evaluation indicators, theoretical completeness boundaries and subsequent AAAI/T-ITS promotion plans. Supplementary focus on 2026-05-19 is: data leakage prevention, failure taxonomy, parameter budgeting, indicator formulas, chart planning and AAAI compliance risks.

1. Research background and goals

Urban low-altitude UAV mission planning is moving from “engineer-preset routes” to “natural language mission-driven”. In actual applications, operators are more likely to give the following instructions:

“First check the east facade of Building 3, then go to the landing point on the roof and wait.”
“Avoid the air above the hospital and reach the temporary delivery area within 30 seconds.”
“If the south corridor is occupied, bypass the west corridor but keep a safe distance of more than 20 meters throughout.”

These instructions simultaneously include semantic understanding, temporal order, spatial constraints, continuous trajectory safety, and reachability judgments. Large language models (LLM) are good at understanding natural language and generating candidate plans, but they cannot guarantee that the output plan is executable in physical space, nor can it guarantee that aviation safety constraints are met. Formal methods are good at giving verifiable semantics, such as Linear Temporal Logic (LTL) and Signal Temporal Logic (STL), but direct handwritten specifications require professional knowledge and are difficult to serve non-expert operators.

Existing work has proven that natural language to LTL translation can significantly reduce the threshold for writing robot task specifications. For example, Lang2LTL converts complex navigation commands into LTL and performs generalization evaluation in unseen environments [1]; NL2LTL provides an open source Python package from natural language to LTL [2]; LTLCodeGen uses code generation to improve the grammatical correctness of LTL and integrates it into robot path planning [3]; ConformalNL2LTL further attempts to use conformal prediction to guarantee translation accuracy [4]. These works provide an important foundation for this study.But for low-altitude UAV scenarios, just doing NL-to-LTL translation is not enough. UAV missions have three additional requirements:

Continuous safety constraints: Constraints such as flight altitude, speed, obstacle distance, time window, etc. are naturally constraints on continuous signals, and are more suitable to be evaluated by STL robustness.
Executable Trajectory Closed Loop: Correct specifications do not mean that the trajectory is feasible and must be verified by maps, dynamics and planners.
Errors can be fixed: LLM errors should not only be judged as errors, but should be converted into counterexample or robustness feedback by the verifier, and then drive LLM correction.

Therefore, this article proposes VERA-UAV: a verification and error-correcting neuro-symbolic planning framework for UAV natural language tasks. The AAAI version prioritizes answering a core question:

Given a natural language UAV mission, how can a native open source LLM generate verifiable, repairable, and executable LTL/STL mission specifications and trajectories, rather than just generating textual plans that appear reasonable but are not provably safe?

The AAAI main conference version focuses on AI planning, neuro-symbolic verification and LLM self-repair. System-level content such as AirSim, real low-altitude logistics, and multi-UAV airspace throughput will be put into subsequent T-ITS extended versions.

2. Problem definition and core assumptions

2.1 Input and output

Given a UAV task instance:

Among them, is the natural language task instruction, is the urban low-altitude map with semantic annotation, and is the UAV initial state. The map contains buildings, no-fly zones, passable airspace, landing points, inspection targets, dynamic obstacles and altitude levels.

System output:

\varphi_{\text{LTL}} = G(\neg collision) \wedge F(reach(goal)) \wedge G(\neg enter(no_fly_zone))

\varphi_{\text{STL}} = G_{[0,T]}(d_{\text{obs}}(t) \ge d_{\min}) \wedge G_{[0,T]}(h_{\min} \le h(t) \le h_{\max}) \wedge F_{[0,30]}(reach(goal))

Indicates that the trajectory satisfies the specification; if , the verifier returns the violation clause, the violation time, and the minimum safety margin.

4.4 Counterexample driver repair

Instead of just returning pass/fail, the validator returns a structured diagnostic:

{
  "status": "FAILED",
  "stage": "STL_ROBUSTNESS",
  "violated_clause": "G[0,T](distance_to_obstacle >= 10)",
  "counterexample_trace": [
    {"t": 14.2, "x": 38, "y": 51, "z": 30, "distance_to_obstacle": 6.4}
  ],
  "robustness": -3.6,
  "repair_hint": "Increase safety margin or route around building_7 west side."
}

LLM’s repair prompt does not require free play, but requires it to only modify relevant fields in TaskIR:

你生成的 TaskIR 在 STL 验证中失败。
失败子句：G[0,T](distance_to_obstacle >= 10)
反例：t=14.2s 时距离 building_7 仅 6.4m。
请只修改 route constraint 或 safety margin，不要改变用户原始目标。
输出新的 TaskIR JSON。

The focus of this design is to reduce the search space of LLM and make the repair behavior explainable, recordable, and reproducible.

If LLM repair fails after consecutive rounds, the symbol enumeration fallback is entered. The enumeration scope is bounded by the TaskIR DSL depth, map entity set, allowed constraint template, and maximum task horizon. The enumerator prioritizes the expansion of the most relevant fields based on diagnostic results, such as safe distance, detour side, time window, target sequence, and fallback landing pad.

4.5 Trajectory generation

The AAAI version uses a lightweight reproducible trajectory generator:

2D grid A*: for basic reach-avoid and sequential tasks.
3D grid A*: used for height levels and urban low-altitude corridors.
RRT*: for continuous spatial supplementary verification.
MPC-lite/trajectory smoothing: used to check whether turning radius, speed change and height change satisfy simplified dynamics constraints.

The trajectory generator is not the innovation of this article. Its function is to advance the specification translation problem to the level of “whether the executable track really exists”.

5. Proof of theoretical properties and relative completeness

v1 only says “verification error correction can improve reliability”, but there is no mathematical boundary. v2 makes the algorithmic properties clear: VERA-UAV does not claim that LLM itself is complete, but rather claims to have relative completeness under the assumptions of a finite DSL, a decidable verifier, and a complete underlying planner.

5.1 Formal setting

Discretize the urban low-altitude map into a limited weighted map:

Double subscripts: use braces to clarifyG=(V,E,w), \quad |V|<\infty, \quad |E|<\infty. $$Each node $v\in V$ carries a set of atomic propositions $L(v)$, such as `goal_A`, `building_7_margin`, `no_fly_zone`, `altitude_layer_3`. Trajectories are finite sequences:

\tau = (v_0, v_1, \ldots, v_T), \quad (v_t,v_{t+1})\in E.

\mathcal{D}_{H,D} = {\psi: \mathrm{depth}(\psi)\le D,\ \mathrm{horizon}(\psi)\le H,\ \mathrm{entities}(\psi)\subseteq \mathcal{E}(\mathcal{M})}.

C(\psi)=(\varphi_{\mathrm{LTL}},\varphi_{\mathrm{STL}}).

V(\tau, C(\psi)) = \begin{cases} \mathrm{PASS}, & \tau \models \varphi_{\mathrm{LTL}}\ \land\ \rho(\tau,\varphi_{\mathrm{STL}})>0,\ \mathrm{FAIL}(\eta), & \text{otherwise}, \end{cases}

You can't use 'macro parameter character #' in math mode where $\eta$ is a counterexample, unsat core, or robustness trace. ### 5.2 Algorithm pseudocode ```text Algorithm VERA-UAV Input: natural language x_NL, map M, initial state s0 Output: verified trajectory tau or UNSAT / NEED_CLARIFICATION 1: Q ← LLM_PROPOSE(x_NL, M) 2: Q ← TYPECHECK_AND_RANK(Q) 3: Visited ← ∅ 4: for iter = 1 ... B do 5: if Q has no unvisited candidate: 6: Q ← Q ∪ SYMBOLIC_ENUMERATE_NEXT(D, H) 7: if Q still has no unvisited candidate: 8: return UNSAT 9: ψ ← POP_UNVISITED(Q, Visited) 10: Visited ← Visited ∪ {ψ} 11: if ψ has missing entity or underspecified field: 12: η ← type / grounding diagnostic 13: Q ← Q ∪ REPAIR(ψ, η) 14: if all remaining candidates require the same external information: 15: return NEED_CLARIFICATION 16: continue 17: (φ_LTL, φ_STL) ← COMPILE(ψ) 18: if compiler or syntax verifier fails: 19: η ← compiler diagnostic 20: Q ← Q ∪ REPAIR(ψ, η) 21: continue 22: τ ← COMPLETE_PLANNER(G, s0, φ_LTL, φ_STL) 23: if τ exists and VERIFY(τ, φ_LTL, φ_STL) = PASS: 24: return τ 25: η ← counterexample / unsat core / robustness trace 26: Q ← Q ∪ LLM_REPAIR(ψ, η) 27: if LLM repair budget exhausted: 28: Q ← Q ∪ SYMBOLIC_ENUMERATE(ψ, η, D, H) 29: return UNSAT ``` ### 5.3 Theorem 1: Terminability **Theorem 1 (Termination).** If the TaskIR DSL $\mathcal{D}_{H,D}$ is finite and the algorithm sets a finite candidate budget $B$, then VERA-UAV must return verified trajectory, `UNSAT` or `NEED_CLARIFICATION` in finite steps.**Proof sketch.** Each time an unvisited candidate TaskIR pops up in the queue $Q$, and is used to avoid repeated expansion through `Visited`. The maximum number of rounds of LLM repair is limited, the symbol enumeration space $\mathcal{D}_{H,D}$ is limited, and the outer loop can be executed at most $B$ times. Therefore the algorithm cannot run infinitely. Each branch either returns or enters the next finite loop. Certification completed. ### 5.4 Theorem 2: Safety and reliability **Theorem 2 (Soundness).** If VERA-UAV returns a trajectory $\tau$, then given the map model, monitor semantics, and trajectory discretization accuracy, $\tau$ satisfies the compiled LTL/STL specification:

\tau \models \varphi_{\mathrm{LTL}} \quad \text{and} \quad \rho(\tau,\varphi_{\mathrm{STL}})>0.

\mathrm{FSR} = \frac{#{\mathrm{unsafe\ but\ returned\ as\ executable}}}{#{\mathrm{all\ returned\ executable}}}.

You can't use 'macro parameter character #' in math mode In the AAAI paper, FSR should be regarded as the most critical negative indicator in the direction of security. The main selling point of VERA-UAV is not to have "output" for all tasks, but to avoid false security. **Statistical Test** - For binary indicators such as ESS, FSR, and UNSAT detection, use paired McNemar test. - For continuous indicators such as robustness, optimality gap, runtime, etc., use paired bootstrap 95% CI and Wilcoxon signed-rank test. - Multiple baseline comparisons use Holm-Bonferroni correction. - Conclusions are only written into the main text when $p<0.05$ and the effect size reaches the pre-registration threshold. **Success Criteria** The minimum conditions for the establishment of AAAI’s main conclusion:1. The ESS of VERA-UAV full is significantly higher than that of LTLCodeGen-style and T3-style baseline. 2. The FSR of VERA-UAV full is significantly lower than all LLM-only baselines. 3. After removing STL robustness feedback, failures related to continuous safety constraints increase significantly. 4. Symbolic fallback provides measurable gains in LLM repair failure samples. ### 8.4 Generalization experiment Generalization dimension: - No map seen. - No entity name seen. - Natural language paraphrase. - Longer timing combinations. - Tighter time window. - Unsatisfied task ratio increase. Generalization experiments focus on reporting whether VERA-UAV can identify unsatisfiable or ambiguous tasks, rather than outputting error trajectories. ### 8.5 Case study Prepare at least three visualization cases: 1. **Syntax repair case**: LLM output is illegal STL, Spot/RTAMT reports an error, system repair. 2. **Trajectory safety case**: LTL is satisfied but STL robustness is negative, and the system turns positive after detouring. 3. **Unsatisfiable case**: User requirements are conflicting, and the system outputs `UNSAT`. ### 8.6 AAAI Main Text Chart Plan AAAI's main text space is very tight, and the charts must serve the core argument. It is recommended that only five types of charts be included in the main text, and appendix is ​​used for the others:| Diagram | Target | Placement | |------|------|----------| | Figure 1: VERA-UAV pipeline | A glance at the closed loop of typed IR, verification, repair, and fallback | Method | | Table 1: Core literature positioning matrix | Proves that this article is not a simple NL-to-LTL application | Related Work | | Table 2: Main experiment results | paired comparison of ESS, FSR, robustness, runtime | Experiments | | Figure 2: failure taxonomy stacked chart | illustrates which failure types the method mainly reduces | Experiments | | Figure 3: Case study trajectory | Shows how counterexample feedback can correct negative robustness to positive | Experiments / Appendix | It is not recommended to enlarge the prompt section, the complete DSL grammar or all map screenshots in the main article. These contents should be placed in the code/data appendix so as not to crowd out the contribution argument. --- ## 9. Ablation Experiment Design| Ablation | Variant | Purpose | |--------|------|------| | Remove typed IR | Direct LTL/STL generation | Verify whether structured intermediate representation improves reliability | | Remove counterexample feedback | Generic retry | Verify whether counterexample is more effective than normal retry | | Remove STL robustness feedback | LTL-only verification | The importance of verifying continuous safety constraints | | one-shot repair | Repair at most 1 time | Evaluate the benefits of repair rounds | | iterative repair | Repair up to 3 times | Evaluate the upper limit of multiple rounds of repair | | Different model sizes | Qwen3-8B / Qwen3-14B / DeepSeek-R1-Distill-Qwen-14B | Evaluate the relationship between model capability and verification framework | | Remove UNSAT detection | Force trace generation | Verify the contribution of denial-of-answer capability to security | | Remove symbol fallback | LLM-only repair | Verify the contribution of relative completeness components to failure recovery | | Remove planner final verification | Only verify formulas but not trajectories | Prove execution of closed loop is not optional | The core of the ablation experiment is not to "prove that the components are effective", but to find out which components contribute the most to the safety and performability indicators that AAAI reviewers are most concerned about. --- ## 10. Evaluation indicators ### 10.1 Specification generation indicators| Indicators | Definition | |------|------| | Syntax validity | Is LTL/STL acceptable to the parser | | Entity grounding accuracy | Whether the command entity is correctly mapped to the map entity | | Semantic F1 | Generate precision / recall / F1 of TaskIR field and gold TaskIR | | Semantic match | Whether the generated specification is equivalent or approximately equivalent to gold TaskIR / gold formula | | UNSAT detection accuracy | Whether the unsatisfiable task is correctly identified | | Clarification accuracy | Whether the fuzzy task triggers `NEED_CLARIFICATION` | | False executable rate | The proportion of unsatisfiable or ambiguous tasks that are incorrectly executed | ### 10.2 Planning execution indicators | Indicators | Definition | |------|------| | ESS | Proportion of tasks that simultaneously satisfy semantics, feasible trajectories, LTL, STL, and safety constraints | | FSR | Proportion of unsafe tasks incorrectly marked as safe to execute | |Mean STL robustness |The average robustness of the final trajectory against the STL specification | | Worst-case STL robustness | Distribution of minimum robustness per trajectory | | Minimum safety margin | Minimum obstacle distance in trajectory | | Optimality gap | $(J(\tau)-J^\star)/J^\star$ | | Path length / flight time | Trajectory cost and flight time | ### 10.3 Repair efficiency indicator| Indicators | Definition | |------|------| | Repair success rate | Repair success rate after verification failure | | Fail-to-pass conversion | The proportion of initial failed samples that pass after being repaired | | Average repair rounds | Average repair rounds | | Fallback contribution | Proportion of LLM repair failure but symbolic fallback success | | Runtime overhead | Extra time caused by repair mechanism | | Token overhead | Fix the token increment caused by prompt and diagnosis | ### 10.4 Indicator calculation details The main experiment needs to implement the following indicators directly in the code to avoid manual arrangement during the paper writing stage: **Semantic F1** Flatten TaskIR into a set of field-level constraints $\mathcal{C}$, such as `reach(A)`, `avoid(zone_B)`, `time_window(A,30)`. Let the prediction set be $\hat{\mathcal{C}}$ and the gold standard set be $\mathcal{C}^\star$:

P = \frac{|\hat{\mathcal{C}}\cap \mathcal{C}^\star|}{|\hat{\mathcal{C}}|}, \quad R = \frac{|\hat{\mathcal{C}}\cap \mathcal{C}^\star|}{|\mathcal{C}^\star|}, \quad F1 = \frac{2PR}{P+R}.

\mathrm{SVR} = \frac{#{\tau: collision \lor nofly \lor altitude_violation \lor \rho(\tau,\varphi_{\mathrm{STL}})\le 0}} {#{\mathrm{returned\ trajectories}}}.

\mathrm{Gap}(\tau)=\frac{J(\tau)-J^\star}{\max(J^\star,\epsilon)}.

\mathrm{FailToPass} = \frac{#{\mathrm{initial\ fail,\ final\ pass}}} {#{\mathrm{initial\ fail}}}, \quad \mathrm{FallbackContribution} = \frac{#{\mathrm{LLM\ repair\ fail,\ symbolic\ fallback\ pass}}} {#{\mathrm{final\ pass}}}.