Paper G1 Complete paper proposal v1: Verifiable LLM Agent for low-altitude traffic cloud brain

Core judgment: The first paper should not be written as “fine-tuning a large low-altitude traffic model”, but should be written as a verifiable, reproducible, and deployable low-altitude traffic LLM Agent method paper.
Recommended topic: CloudBrain-Agent: Tool-Augmented and Verification-Guided LLM Agents for Low-Altitude Traffic Operation.

1. Paper positioning and submission judgment

1.1 Positioning in one sentence

This article studies the large model agent in the low-altitude traffic cloud brain: given a natural language task, urban low-altitude airspace status, UAV fleet status and safety constraints, how the LLM agent can generate safe, executable and interpretable low-altitude traffic operation decisions through structured intermediate representation, tool invocation, formal verification and simulation feedback.

1.2 Recommended contributions

Preferred: AAAI/IJCAI Master.
Alternatives: AAMAS, IROS/ICRA workshop, T-ITS follow-up expansion.

According to the time point of 2026-05-20, the specific session needs to be aligned with the next round of AAAI/IJCAI CFP; this article is still designed in the style of the AAAI/IJCAI main conference, because AAAI emphasizes AI methods, application fields and reproducibility, and the IJCAI-ECAI AI and Robotics track clearly focuses on robot agents, generative AI, reasoning, structured modeling and action consequences [1] [2].

1.3 Why is this article more suitable to do first than “Low-altitude traffic large model fine-tuning”

Directly fine-tuning a LowAltitudeGPT will encounter three review risks:

LoRA, QLoRA, and DPO are already mature training paradigms. Simply changing domain data is not enough to constitute the main contribution [3] [4] [5].
Low-altitude traffic is a safety-critical system, and it is difficult to convince reviewers that LLM directly outputs control actions.
Real low-altitude traffic operation data is scarce. If you focus on “large model training” in the first article, you will be asked about data scale, training budget, and model novelty.Therefore the first article should focus on Agent + Tools + Verifier + Simulator Feedback. The large model is not the final controller, but a layer of task understanding, tool orchestration, counterexample repair, and interpretation. This setting is naturally connected with agent/tool-use/planning work such as ReAct, ToolLLM, LLM+P [6] [7] [8], and can also catch up with TrafficGPT’s discussion of the interaction between traffic foundation model and LLM [9].

1.4 2026-05-22 Writing Calibration: Don’t write G1 as a TR-C story, but keep the traffic system evidence

The first investment in G1 is AAAI/IJCAI, so the main contribution must be the AI agent method rather than the transportation journal-style system narrative. A more accurate way to write it is:

CloudBrain-Agent is an AI agent method evaluated in a safety-critical low-altitude traffic domain.

In other words, the traffic scene provides real difficulty and safety constraints, but the paper still needs to answer questions in the agent field: whether the tool call is reliable, whether the state is consistent, whether the counterexample repair is effective, whether the model is an illusion, and whether the evaluation is reproducible.

At the same time, G1 cannot only report task_success and tool_call_accuracy. Because low-altitude traffic is a safety-critical area, traffic system evidence must be preserved from the first version of the experiment:| Level | AAAI/IJCAI main text focus | Follow-up T-ITS expansion focus | |------|---------------------|---------------------| | Agent capabilities | IR validity, tool-call accuracy, repair success, hallucination rate | human confirmation, operator workload, stateful consistency | | Safety | safety violation, NFZ violation, battery violation | LoWC/NMAC proxy, risk ratio, weather/communication degradation | | Efficiency | executable decision, latency, runtime | delay, extra distance, energy, throughput | | Generalization | unseen city, stress, UNSAT/ambiguous tasks | high-density corridor, non-cooperative UAV, communication loss, real context city split | | System Enlightenment | When is verifier feedback necessary | Which scenarios must be returned from LLM agent deterministic solver/human supervisor |

Therefore, the boundary conditions of G1 must be written clearly:

Do not claim real deployment;
Does not claim end-to-end automatic control;
LLM is not claimed to be a replacement scheduler/planner/validator;
only claims that the LLM agent is responsible for task understanding, orchestration, repair and interpretation in the tool chain and verification feedback;
The transportation system conclusions are only written as “observable operational implications” and are not exaggerated into policy recommendations.

1.5 2026-05-23 Compilation: frozen list of submission versions

The first version of G1 submission must freeze three claims to avoid turning into a low-altitude platform specification:1. Domain-grounded tool-use benchmark: CloudBrain-Bench not only tests the JSON format, but also tests the function selection, parameter grounding, state dependency, policy compliance and multi-round consistency in the low-altitude transportation chain. 2. Verifier-guided repair: Safety errors, unexecutable errors and ambiguous tasks in low-altitude traffic missions must be converted into structured repair signals through LTL/STL verifier, route planner and simulator feedback. 3. Local-deployable agent implementation: The main experiment must be reproducible on the local open source model, and the API model only serves as teacher or upper bound.

The first part must be completed:| Modules | Freeze Requirements | |------|----------| | LowAltitudeIR | Fixed schema, type checker, error codes and JSON examples | | Tools | At least 6: airspace query, fleet status, assignment, route planner, LTL/STL verifier, scenario simulator / risk estimator | | CloudBrain-Bench | dev/validation/test/stress split, covering SAT, UNSAT, ambiguous, resource-limited, stress scenarios | | Baselines | Direct LLM, JSON-only, ReAct, LLM+P / planner-only, tool-use no verifier, CloudBrain full | | Metrics | task success, tool-call accuracy, executable decision, safety violation, repair success, hallucination rate, latency/cost | | Ablations | no IR, no verifier, no simulator, no repair, API teacher vs local model | | Data layer | synthetic master data + OSM/FAA/OD/SUMO real context fields, do not write real data as deployed system |

The first suspended content:

Complete MCP productization;
Multi-agent collaboration as main contribution; -Write the LowAltitudeGPT fine-tuning model as the main method;
Real UAV deployment or flight;
VLA/world model/embodied AGI proposition.

The function of this frozen list is to control the boundaries of the paper: G1 only proves that “verifiable LLM agent in the key domain of low-altitude traffic safety” is established, and subsequent G2/G3/G4 will deal with fine-tuning, multi-agent and embodied expansion respectively.

---## 2. Draft abstract

Urban low-altitude traffic operations require real-time decision-making among dynamic tasks, limited airspace resources, UAV status constraints, and safety rules. Large language models have the ability to understand natural language and decompose complex tasks, but if used directly for UAV scheduling and path planning, they will produce hallucinations, unexecutable plans, and safety violations. This article proposes CloudBrain-Agent, a tool enhancement and verification guidance LLM agent framework for low-altitude traffic cloud brain. CloudBrain-Agent parses natural language tasks and system states into typed LowAltitudeIR, invokes airspace query, UAV allocation, path planning, LTL/STL verification, scenario simulation and risk assessment tools, and iteratively fixes decisions using verifier counterexample and simulation feedback. We build CloudBrain-Bench to cover emergency distribution, inspections, no-fly zone avoidance, corridor congestion, charging bottlenecks, multi-mode fallback and unsatisfactory tasks. The experiment will compare direct LLM, prompt-only ReAct, tool-use without verification, LLM+P, TrafficGPT-style orchestration and CloudBrain-Agent full. The pre-registration expectation is that CloudBrain-Agent significantly outperforms prompt-only and tool-only baselines in task success, executable decision rate, safety violation rate, hallucination rate, and repair success, while maintaining acceptable local deployment latency.

3. Research questions and core hypotheses

3.1 Research questions

RQ1: Can the LLM agent stably generate decision chains of the correct type and tool-executable in low-altitude traffic missions?

RQ2: Can formal verification and simulation feedback significantly reduce unexecutable plans, security violations, and hallucinations in LLM?

RQ3: Compared with directly fine-tuning the vertical model, can the solution of general LLM + typed IR + MCP/tools + verifier form a reproducible, deployable, and scalable research system faster?RQ4: Can the local open source model approach the closed source strong model performance under the data and rule feedback generated by the teacher API, and support the subsequent LowAltitudeGPT paper?

3.2 Core Assumptions

H1: typed LowAltitudeIR can significantly improve structured output quality and tool-call accuracy.
H2: Verification-guided repair can significantly improve the executable decision rate and reduce the safety violation rate.
H3: Simulator feedback is most critical for generalization of unseen dangerous scenes.
H4: There is no need to train the vertical foundation model in the first stage; the general model + agent tool layer + verifier post-processing is enough to complete the G1 paper.
H5: After the local Qwen3 / DeepSeek-R1-Distill model is deployed through vLLM, it can be used as a reproducible main experimental model; API models such as GPT-5.2 serve as teachers and performance upper limits [10] [11] [12].

4. Paper contribution design

It is recommended that the final contribution of the paper be written in three articles to avoid being scattered:

CloudBrain-Agent framework A typed tool-use LLM agent is proposed for low-altitude traffic cloud brain, which unifies natural language tasks, urban airspace status, UAV fleet status and safety constraints into LowAltitudeIR.
Verification-guided repair for low-altitude traffic operation Transform failure feedback from LTL/STL verifiers, route planners, and simulators into structured counterexamples that drive LLM repair tool invocations, task constraints, and path/scheduling recommendations.3. CloudBrain-Bench and evaluation protocol Build a low-altitude traffic cloud brain benchmark, covering indicators such as tool-call accuracy, executable decision, safety violation, repair success, generalization, latency and human trust.

It is not recommended to write the contribution as “We trained a large low-altitude traffic model”. Fine-tuning can be done as an experimental extension or as the next G2.

4.1 Paper positioning matrix after the second round of research

After online research, the best entry point for G1 should be more clearly domain-grounded agent evaluation + safety verification, rather than general LLM applications. AgentBench proves that LLM agents need to evaluate reasoning and decision-making in an interactive environment [34]; BFCL explains that function calling needs to check function selection, parameters, parallel calls and relevance detection [35]; -bench further emphasizes multi-round interaction, API, domain policy and consistency index pass^k [36]; ToolSandbox points out that state dependency, canonicalization and insufficient information are the key difficulties of tool-based agents. [37].

The inspiration for G1 from these works is: CloudBrain-Bench cannot only evaluate “whether JSON is output”, but also evaluates the agent’s status update, rule compliance, tool dependency, failure repair and multi-round consistency in the low-altitude transportation chain.| Already directed | Representative work | Limitations | Differences in G1 | |----------|----------|------|-----------| | General agent benchmark | AgentBench, -bench, ToolSandbox [34] [36] [37] | Does not include low-altitude traffic safety constraints and UAV tool chain | Domain tools, policy, verifier for UTM/UAV | | function calling benchmark | BFCL [35] | Focus on the correctness of function calls and not care about physical executability and security | Tool calls must go through planner/verifier/simulator | | LLM + traffic | TrafficGPT, ITS LLM survey [9] [13] [14] | Multi-focus ground traffic or traffic model interaction | Extension to low-altitude airspace, UAV fleet and formal safety | | NL-to-LTL / robot task spec | Lang2LTL, LTLCodeGen, ConformalNL2LTL [21] [22] [23] | Mainly solve specification generation | Put specification verification into the complete cloud brain decision-making closed loop | | UTM/UAM simulation | NASA TCL4, CORUS-XUAM, AAM-Gym [38] [39] [40] | LLM agent tool orchestration is usually not studied | Support CloudBrain-Bench with UTM/UAM concepts and scenarios |

5.1 LLM for transportation

TrafficGPT explains that LLM can be used as an interaction and processing entrance for traffic foundation models, but also points out that traffic numerical data, simulation and model interaction cannot be generated solely by plain text [9]. Recent ITS reviews further place LLM in traffic semantic interfaces, decision aids, and multi-source data understanding [13] [14]. UrbanGPT and UniST represent the direction of urban space-time foundation model and are suitable for supporting urban state understanding, but they are not low-altitude UAV operation tool chains [15] [16].### 5.2 LLM agents and tool use

ReAct interweaves reasoning trace and action and is the basis of the agent loop in this article [6]. Toolformer and ToolLLM prove that LLM can learn API/tool usage, but they do not solve the problems of low-altitude traffic safety verification and mission executability [7][17]. MCP and OpenAI Agents SDK provide a more standard tool connection method, which helps make scheduler, planner, verifier and simulator into replaceable tools [18] [19].

After the second round of research, related work should also add the agent evaluation system: AgentBench is a multi-environment LLM-as-agent benchmark [34]; BFCL specifically evaluates function calling and relevance detection [35]; -bench uses multiple rounds of user-agent-tool interaction and pass^k to evaluate reliability [36]; ToolSandbox emphasizes tool execution status, implicit dependencies and insufficient information scenarios [37]. The G1 evaluation protocol should incorporate these ideas but change the environment to a low-altitude traffic cloud brain.

5.3 LLM planning and formal verification

LLM+P and PlanBench show that LLM alone is not reliable for planning and needs to be combined with external planners, formal representations and evaluation protocols [8] [20]. Lang2LTL, LTLCodeGen and ConformalNL2LTL illustrate that the translation of natural language to temporal logic is developing, but they mainly focus on specification generation and incomplete coverage of scheduling, routing, simulation and risk closed loops in the low-altitude traffic cloud brain [21] [22] [23]. Spot and RTAMT can be used as LTL/STL verification tools respectively [24] [25].

5.4 UAV, UTM, and simulation dataFAA UTM defines low-altitude UAV traffic management as a collaborative ecology that supports flight planning, authorization, surveillance, and conflict management [26]. FAA UAS Facility Maps provide an altitude reference that can be quickly approved for Part 107 operations in controlled airspace, and are suitable for airspace rules proxy [27]. OSM/Overpass, NYC TLC OD data, SUMO, AirSim and Flightmare can jointly support the synthetic-to-real benchmark [28] [29] [30] [31] [32].

To enhance low-altitude traffic credibility, G1 should further cite the NASA TCL4 Nevada flight tests: this test includes BVLOS, urban canyon, weather front, concert emergency response and CNS issue scenarios, and is suitable as a source for scenario taxonomy and human-system information quality discussions [38]. European CORUS-XUAM provides U-space/UAM operational concept, U3/U4 service models, ATM-U-space coordination, vertiport guidance and human-in-the-loop evidence [39]. AAM-Gym can be used as a simulation control for advanced air mobility AI testbed, especially corridor separation assurance [40].

6. Problem Formulation

6.1 System status

At the discrete decision time , the low-altitude traffic cloud brain receives the system status:

Among them:- : A collection of UAVs. Each UAV has position, power, load, speed, and mission status.

: task collection, including distribution, inspection, emergency response, return, and charging.
: Airspace status, including corridor, no-fly zone, altitude, weather, and capacity.
: City map, including OSM road network, POI, buildings, and functional areas.
: safety and operational constraints, including LTL/STL, deadline, distance, energy.
: historical events, failure cases, human feedback and verifier feedback.

Natural language instructions are denoted . The goal is to generate executable decisions:

Where is LowAltitudeIR, is the tool call sequence, is the scheduling/path/risk decision, and is the explanation.

6.2 Safe Executable Targets

A decision is considered successful if and only if:

Schema validity: satisfies the LowAltitudeIR type constraint.
Tool executability: All tool call parameters are legal and return non-error results.
Planning feasibility: Scheduling and path planning are executable.
Temporal safety: LTL/STL specifications verified.
Simulation robustness: Does not trigger collisions, no-fly zone violations, or deadline violations in specified scenario seeds.
Human interpretability: Interpretation does not involve non-existent entities, tools or rules.

formal:$$ \text{Success}(\pi_t) = \mathbb{1}[ V_\text{schema}(z_t) \land V_\text{tool}(a_{1:k}) \land V_\text{plan}(y_t) \land V_\text{logic}(y_t) \land V_\text{sim}(y_t) ]

E_\text{route} = L_\text{route} \cdot q_{0.9}(e \mid v, h, p, w)

\text{IR-EM} = \frac{1}{N}\sum_i \mathbb{1}[z_i = z_i^*]

You can't use 'macro parameter character #' in math mode **IR field F1**: Calculate precision, recall, and F1 respectively for fields such as intent, entities, constraints, and tool plan. ### 12.2 Tool call indicator **Tool-call accuracy**：

\text{TCA} = \frac{#\text{correct tool calls}}{#\text{all tool calls}}

\text{TDS} = \frac{#\text{tool chains satisfying all data dependencies}}{#\text{tool chains}}

You can't use 'macro parameter character #' in math mode It measures whether the agent first queries airspace/city status, then plans and verifies, rather than relying on downstream tools. ### 12.3 Executability Indicators **Executable decision rate**：

\text{EDR} = \frac{#\text{planner executable decisions}}{N}

：

\text{TSR} = \frac{#\text{fully verified and simulated successful tasks}}{N}

You can't use 'macro parameter character #' in math mode ### 12.4 Security indicators **Safety violation rate**：

\text{SVR} = \frac{#\text{safety violated tasks}}{N}

You can't use 'macro parameter character #' in math mode Violation types include: - no-fly zone intrusion; - altitude violation; - min separation violation; - battery reserve violation; - deadline violation; - unsafe fallback; - hallucinated permission. The extended version of low-altitude transportation recommends further transporting safety indicators:| Indicators | Definition | Purpose | |------|------|------| | LoWC proxy | The ratio below well-clear separation at any time | Measuring the risk of loss of separation | | NMAC proxy | Number of times below near-mid-air-collision threshold | Measure of severe near-mid risk | | Risk ratio | The proportion of risk events relative to the rule-based safe baseline | Make different scenarios comparable | | Safe-refusal precision | The proportion of rejections/requests for manual confirmation that are truly unsafe to execute | Preventing the agent from being overly conservative | The AAAI/IJCAI main text can only report SVR and violation type breakdown; the T-ITS extension should report LoWC/NMAC proxy and risk ratio. ### 12.5 Hallucination Indicator **Hallucination rate**：

\text{HR} = \frac{#\text{outputs containing nonexistent entity/tool/rule}}{N}

You can't use 'macro parameter character #' in math mode ### 12.6 Repair indicators **Repair success rate**：

\text{RSR} = \frac{#\text{failed first attempts repaired within K iterations}}{#\text{failed first attempts}}

\text{pass}^k = \frac{#\text{tasks successful in all } k \text{ repeated runs}}{N}