Multi-agent reinforcement learning and graph attention network: an end-to-end solution for UAV cluster conflict resolution

In previous article, we sorted out the algorithm panorama of UAV conflict resolution. Among them, reinforcement learning (especially MARL) is labeled as the “most realistic option” for swarms of 50+ drones. This article will focus on this route, starting from the foundation of single-agent RL, going into the core challenges of multi-agent scenarios, analyzing mainstream algorithms such as MADDPG, QMIX, COMA, and MAPPO, and focusing on how GAT (Graph Attention Network) provides MARL with scalable topology awareness capabilities, and ultimately achieves an end-to-end conflict resolution strategy.

1. From single agent to multi-agent: Why is MARL so difficult?

1.1 Single-agent RL review

Let’s start with the familiar single-agent RL. The single-agent MDP is described by the quadruple :

State value function:
Action value function:
Optimal strategy:

The core assumption of single-agent RL: The environment is stable - no matter how many episodes you train, the dynamics of the environment always remain unchanged.

1.2 Three essential difficulties of multi-agent

Multi-agent scenarios break this assumption and bring about three fundamental difficulties:

① Environmental non-stationarity (Non-Stationarity)

When agent is learning policy , the policies of other agents are also changing. This means:$$ \mathcal{P}_i(s’\mid s, a_1,\dots,a_n) \neq \mathcal{P}_i(s’\mid s, a_1,\dots,a_n, a_1’,\dots,a_n’)

②

r_t = f(\mathbf{s}_t, \mathbf{a}t, \mathbf{s}{t+1})

You can't use 'macro parameter character #' in math mode For example: multiple UAVs collaborated to avoid an obstacle. How much did each agent contribute? If only a few are rewarded, other agents will stop learning. **③ Joint Action Space Index Explosion** There are $n$ UAVs, each with $|\mathcal{A}|$ action options, and the joint action space $|\mathcal{A}|^n$ grows exponentially with $n$. The coverage of greedy exploration in the joint space approaches zero. ### 1.3 MARL algorithm classification In response to the above difficulties, the academic community has developed three main routes: | Route | Representative algorithm | Core idea | Representative paper | |------|----------|----------|-----------| | **Independent Learning (IL)** | IQL, DQN | Each learns his own thing and ignores the influence of others | Tan, 1993 | | **Centralized training + decentralized execution (CTDE)** | MADDPG, QMIX, MAPPO | Use global information during training and local observations during execution | Lowe et al., 2017 | | **Fully decentralized** | COMA, VDND | Purely local strategy, no centralized training | Foerster et al., 2018 |> **CTDE is the current mainstream paradigm for UAV conflict resolution**: it can not only use global information during training to improve learning efficiency, but also maintain real-time decision-making capabilities under limited communication during execution. --- ## 2. CTDE framework: use God’s perspective for training and local observation for execution ### 2.1 Centralized Critic’s Design Philosophy The core insight of CTDE is: **The training phase and the execution phase can have different information availability**. ``` ┌─────────────────────────────────────────────────────────┐ │ 中心化训练（Centralized Training） │ │ Critic(s₁,...,sₙ, a₁,...,aₙ) → Q(s,a) │ │ ✅ 可访问全局状态 & 所有智能体的动作 │ │ ✅ 环境是"平稳的"（给定全局状态-动作对） │ │ │ │ 去中心化执行（Decentralized Execution） │ │ πᵢ(oᵢ → aᵢ) │ │ ✅ 只依赖本地观测 oᵢ │ │ ✅ 通信失败时仍可运行 │ └─────────────────────────────────────────────────────────┘ ``` ### 2.2 MADDPG: The pioneer of CTDE in continuous action space **MADDPG (Multi-Agent DDPG)** was proposed by OpenAI in 2017 and is a milestone in continuous action space multi-agent deep reinforcement learning. **Core formula:** Each agent $i$ maintains an Actor-Critic structure:

\nabla_{\theta_i} J(\theta_i) = \mathbb{E}{\mathbf{s} \sim \mathcal{D}}\left[ \nabla{\theta_i} \log \pi_i(a_i \mid o_i) \cdot Q_i^\pi(\mathbf{s}, a_1, \dots, a_n) \Big|_{a_i = \pi_i(o_i)} \right]

You can't use 'macro parameter character #' in math mode Key difference: The inputs to $Q_i^\pi$ are the global state $\mathbf{s}$ and the joint actions $\mathbf{a}$ of all agents, not local observations. ```python import torch import torch.nn as nn import torch.optim as optim import numpy as np from collections import deque import random # ============================================================ # MADDPG 核心实现（用于 UAV 冲突消解场景） # ============================================================ class ReplayBuffer: """共享经验回放池（所有智能体的经验统一存储）""" def __init__(self, capacity=100000): self.buffer = deque(maxlen=capacity) def push(self, state, actions, reward, next_state, done): self.buffer.append((state, actions, reward, next_state, done)) def sample(self, batch_size): batch = random.sample(self.buffer, batch_size) states, actions, rewards, next_states, dones = zip(*batch) return (np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)) def __len__(self): return len(self.buffer) class Actor(nn.Module): """演员网络：本地观测 → 动作（去中心化执行）""" def __init__(self, obs_dim, action_dim, hidden_dim=64): super().__init__() self.net = nn.Sequential( nn.Linear(obs_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Tanh(), # 连续动作输出（速度变化量） nn.Linear(hidden_dim, action_dim), nn.Tanh() # 动作限幅 [-1, 1] ) def forward(self, obs): return self.net(obs) class Critic(nn.Module): """评论家网络：全局状态 + 联合动作 → Q值（中心化训练）""" def __init__(self, total_obs_dim, total_action_dim, n_agents, hidden_dim=64): super().__init__() # 输入：全局状态 + 所有智能体的动作拼接 input_dim = total_obs_dim + n_agents * total_action_dim self.net = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) # 输出单个 Q 值 ) def forward(self, states, all_actions): """ states: (batch, total_obs_dim) 全局状态 all_actions: (batch, n_agents * action_dim) 所有智能体的动作 """ x = torch.cat([states, all_actions], dim=1) return self.net(x) class MADDPGAgent: """MADDPG 智能体""" def __init__(self, obs_dim, action_dim, n_agents, agent_id, lr_actor=1e-3, lr_critic=1e-3, gamma=0.95, tau=0.01): self.agent_id = agent_id self.action_dim = action_dim self.n_agents = n_agents self.gamma = gamma self.tau = tau # 演员网络（本地策略） self.actor = Actor(obs_dim, action_dim) self.actor_target = Actor(obs_dim, action_dim) self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor) # 评论家网络（全局 Q） total_obs = obs_dim * n_agents total_act = action_dim * n_agents self.critic = Critic(total_obs, total_act, n_agents) self.critic_target = Critic(total_obs, total_act, n_agents) self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=lr_critic) # 目标网络初始化 self.hard_update(self.actor_target, self.actor) self.hard_update(self.critic_target, self.critic) def hard_update(self, target, source): """硬更新（一次性复制）""" target.load_state_dict(source.state_dict()) def soft_update(self, target, source): """软更新（指数滑动平均）""" for tp, sp in zip(target.parameters(), source.parameters()): tp.data.copy_(self.tau * sp.data + (1 - self.tau) * tp.data) def select_action(self, obs, noise=0.1): """选择动作（探索时加噪声）""" obs_t = torch.FloatTensor(obs).unsqueeze(0) action = self.actor(obs_t).squeeze(0).numpy() action += noise * np.random.randn(self.action_dim) return np.clip(action, -1, 1) def update(self, agents, replay_buffer, batch): """单步更新""" states, all_actions, rewards, next_states, dones = batch # ----- Critic 更新 ----- # 目标动作用目标演员网络生成 next_actions = [] for agent_id, agent in enumerate(agents): next_obs = torch.FloatTensor(next_states[:, agent_id * 4:(agent_id+1)*4]) # 假设 obs 维4 next_actions.append(agent.actor_target(next_obs)) next_actions_cat = torch.cat(next_actions, dim=1) # 目标 Q 值 target_Q = self.critic_target( torch.FloatTensor(next_states), next_actions_cat.detach() ) expected_Q = self.critic( torch.FloatTensor(states), torch.FloatTensor(all_actions) ) critic_loss = nn.MSELoss()(expected_Q, target_Q.detach()) self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() # ----- Actor 更新 ----- # 当前智能体的动作（其他智能体动作用 replay buffer 中的值） current_obs = torch.FloatTensor(states[:, self.agent_id*4:(self.agent_id+1)*4]) current_action = self.actor(current_obs) # 构造完整的动作向量（当前智能体用当前策略，其他用历史动作） actions_fixed = torch.FloatTensor(all_actions).clone() actions_fixed[:, self.agent_id*self.action_dim:(self.agent_id+1)*self.action_dim] = current_action actor_loss = -self.critic( torch.FloatTensor(states), actions_fixed ).mean() self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # ----- 目标网络软更新 ----- self.soft_update(self.actor_target, self.actor) self.soft_update(self.critic_target, self.critic) return actor_loss.item(), critic_loss.item() ``` ### 2.3 QMIX: Value decomposition to solve credit allocation MADDPG solves the problem of continuous action space, but Critic requires global state $\mathbf{s}$ - in real UAV scenarios, the central node may not be able to obtain the global state. The core innovation of **QMIX** (Queensland Institute, 2018) is: **decomposing the joint Q-value into marginal Q-values ​​for individual agents**.$$ Q_{tot}(\boldsymbol{\tau}, \mathbf{u}) = g_\theta(\boldsymbol{\tau}, \mathbf{u}; \boldsymbol{\phi}_1, \dots, \boldsymbol{\phi}_n)

Where is the Action-Observation Trajectory of agent , is a Monotonic Mixing Network, satisfying:

The monotonicity constraint guarantees a key property: During decentralized execution, each agent’s independent greedy maximization of is equivalent to the global maximization of .

class QMIXMixingNetwork(nn.Module):
    """
    单调混合网络：将各智能体的 Q_i 混合为全局 Q_tot
    关键约束：所有权值非负（保证单调性）
    """
    def __init__(self, n_agents, embed_dim=64):
        super().__init__()
        # Hyper-network 生成混合网络的权值
        self.hyper_w1 = nn.Sequential(
            nn.Linear(n_agents, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, n_agents * embed_dim),  # 输出 (n_agents × embed_dim) 权值
        )
        self.hyper_b1 = nn.Linear(n_agents, embed_dim)
        self.hyper_w2 = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, embed_dim)
        )
        self.hyper_b2 = nn.Linear(embed_dim, 1)
    
    def forward(self, q_values, state):
        """
        q_values: (batch, n_agents) 各智能体的 Q 值
        state: (batch, state_dim) 全局状态（用于生成 hyper-network 输入）
        """
        batch_size = q_values.size(0)
        
        # 第一层：W₁ * Q + b₁
        w1 = torch.abs(self.hyper_w1(state))          # (batch, n_agents * embed_dim)
        w1 = w1.view(batch_size, q_values.size(1), -1)  # (batch, n_agents, embed_dim)
        b1 = self.hyper_b1(state).unsqueeze(1)       # (batch, 1, embed_dim)
        
        q_hidden = torch.relu(torch.bmm(q_values.unsqueeze(1), w1) + b1)  # (batch, 1, embed_dim)
        
        # 第二层：W₂ * h + b₂
        w2 = torch.abs(self.hyper_w2(q_hidden.squeeze(1)))  # (batch, embed_dim, embed_dim)
        b2 = self.hyper_b2(q_hidden.squeeze(1)).unsqueeze(1)  # (batch, 1, 1)
        
        q_tot = torch.bmm(q_hidden, w2.unsqueeze(1)) + b2  # (batch, 1, 1)
        return q_tot.squeeze(-1)  # (batch,)


class QMIXAgent:
    """QMIX 算法"""
    def __init__(self, obs_dim, action_dim, n_agents, agent_id):
        self.agent_id = agent_id
        self.action_dim = action_dim
        
        # 每个智能体的 RNN（处理动作-观测历史）
        self.rnn = nn.GRUCell(obs_dim + action_dim, obs_dim)
        # Q 网络
        self.q_net = nn.Sequential(
            nn.Linear(obs_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )
        
        self.target_rnn = nn.GRUCell(obs_dim + action_dim, obs_dim)
        self.target_q_net = nn.Sequential(
            nn.Linear(obs_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim)
        )
        self.hard_update()
    
    def hard_update(self):
        self.target_rnn.load_state_dict(self.rnn.state_dict())
        self.target_q_net.load_state_dict(self.q_net.state_dict())
    
    def get_q_values(self, hidden, obs, last_action):
        """给定 (hidden, obs, last_action) 输出 Q(s,a)"""
        rnn_input = torch.cat([obs, last_action], dim=1)
        new_hidden = self.rnn(rnn_input, hidden)
        q_values = self.q_net(new_hidden)
        return q_values, new_hidden
    
    def select_action_epsilon_greedy(self, q_values, epsilon):
        """ε-贪心策略"""
        if random.random() < epsilon:
            return random.randint(0, self.action_dim - 1)
        return q_values.argmax(dim=1).item()


def train_qmix():
    """QMIX 训练循环（伪代码）"""
    n_agents = 8
    n_episodes = 50000
    
    agents = [QMIXAgent(obs_dim=12, action_dim=5, n_agents=n_agents, agent_id=i)
              for i in range(n_agents)]
    mixer = QMIXMixingNetwork(n_agents)
    
    optimizers = [optim.Adam(agent.q_net.parameters(), lr=2e-4) for agent in agents]
    mixer_optimizer = optim.Adam(mixer.parameters(), lr=2e-4)
    
    replay = ReplayBuffer(capacity=100000)
    
    for ep in range(n_episodes):
        # 环境交互
        states = env.reset()  # (n_agents, obs_dim)
        hidden = [torch.zeros(1, 12) for _ in range(n_agents)]
        last_actions = [torch.zeros(1, 5) for _ in range(n_agents)]
        episode_reward = 0
        
        while not done:
            actions = []
            for i, agent in enumerate(agents):
                q_vals, hidden[i] = agent.get_q_values(hidden[i],
                    torch.FloatTensor(states[i]).unsqueeze(0),
                    last_actions[i])
                a = agent.select_action_epsilon_greedy(q_vals.squeeze(0), epsilon=0.1)
                actions.append(a)
                last_actions[i] = torch.zeros(1, 5)
                last_actions[i][0, a] = 1.0
            
            next_states, rewards, done = env.step(actions)
            replay.push(states, last_actions, rewards, next_states, done)
            states = next_states
            episode_reward += sum(rewards)
        
        # 学习
        if len(replay) > 1024:
            batch = replay.sample(32)
            # QMIX 损失计算 ...
            # 单调混合 + 中心化训练 ...

2.4 MAPPO: The victory of policy gradient in highly parallel scenarios

MAPPO (Multi-Agent PPO) extends the PPO algorithm to multi-agent scenarios and has performed well in UAV cluster tasks in recent years (multiple top conference papers from 2022 to 2024).

Core advantages of PPO: Trust Region constraints ensure training stability and avoid the hyperparameter disaster of the DDPG series.

PPO -clip target:

Where is the probability ratio, is GAE (Generalized Advantage Estimation).Typical configuration of MAPPO in UAV conflict resolution:

Parameters	Recommended values	Description
Clip ratio	0.2	PPO default
Horizon	128–256	Number of rollout steps per epoch
PPO epochs	2–4	Number of repeated updates per batch
GAE	0.95	Bias-variance balance of dominance estimate
Hidden layer dimension	64–128	Sufficient for UAV scenarios
Normalization	OBS + Reward Normalization	Key! Great impact on multi-agent convergence

3. GAT: Let MARL learn “who to follow”

3.1 Why does MARL need a graph structure?

In a UAV cluster, not all agents are equally important. Take conflict resolution as an example:

UAV about to collide with me → High Concern
UAVs far out of sight → can be ignored
Approaching moving obstacles → requires dynamic attention

However, traditional MARL (such as MADDPG, QMIX) treats all neighbors equally: either fully connected ( communication), or fixed topology (such as ring, nearest neighbor).

The introduction of GAT solves two core problems:

Adaptive Neighbor Weight: Learn which neighbors are more important to the current decision through the attention mechanism
Scalability: Does not increase with the number of drones and supports dynamic topology

3.2 GAT core principles

GAT performs Neighbor Aggregation on the features of node at each layer, and the weight is dynamically calculated by the attention mechanism:$$ \alpha_{ij} = \frac{\exp\left(\text{LeakyReLU}\left(\mathbf{a}^\top[\mathbf{W}\mathbf{h}_i \Vert \mathbf{W}\mathbf{h}j]\right)\right)} {\sum{k \in \mathcal{N}_i} \exp\left(\text{LeakyReLU}\left(\mathbf{a}^\top[\mathbf{W}\mathbf{h}_i \Vert \mathbf{W}\mathbf{h}_k]\right)\right)}

\mathbf{h}i’ = \sigma\left(\sum{j \in \mathcal{N}i} \alpha{ij} \mathbf{W}\mathbf{h}_j\right)

\pi_{safe}(s) = \text{Proj}{\mathcal{A}{safe}(s)} \pi(s)

You can't use 'macro parameter character #' in math mode where $\mathcal{A}_{safe}(s)$ is the set of safe actions under state $s$ (such as the velocity space that satisfies the collision avoidance constraints). This is more reliable than penalizing collisions in the reward function - hard constraints take precedence over soft rewards. --- ## 6. Summary: Technical overview of GAT-MARL From single-agent RL to multi-agent reinforcement learning, to graph attention enhancement, we have taken a **scalable end-to-end conflict resolution** route:| Level | Technology | Problem solved | |------|------|----------| | **Learning Framework** | CTDE (centralized training + decentralized execution) | Environmental non-stationarity | | **Algorithm** | MADDPG / MAPPO / QMIX | Credit Allocation + Continuous/Discrete Actions | | **Topology Modeling** | GAT | Adaptive neighbor weights + scalability | | **Safety Constraints** | Safety Layer / Hard Constraints | Collision Guarantee (vs. Soft Rewards) | | **Training Paradigm** | PPO (SIDESTEP/TRPO) | Training Stability | The most noteworthy directions in the future: - **Foundation Model + UAV**: Use large language model for task-level instruction understanding + MARL for low-level control - **Real flight verification**: Sim-to-Real migration is still a core challenge - **Communication limited scenario**: GAT robustness under no communication or communication delay --- **References:**1. Lowe, R., et al. (2017). *Multi-agent actor-critic for mixed cooperative-competitive environments (MADDPG).* Conference on Neural Information Processing Systems (NeurIPS). 2. Foerster, J., et al. (2018). *Counterfactual multi-agent policy gradients (COMA).* AAAI Conference on Artificial Intelligence. 3. Rashid, T., et al. (2018). *QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning.* International Conference on Machine Learning (ICML). 4. Veličković, P., et al. (2018). *Graph attention networks.* International Conference on Learning Representations (ICLR). 5. Everett, M., et al. (2021). *Collision avoidance in dense traffic with deep reinforcement learning.* IEEE International Conference on Robotics and Automation (ICRA). 6. Hu, E. J., et al. (2021). *LoRA: Low-Rank Adaptation of Large Language Models.* International Conference on Learning Representations (ICLR). 7. Fan, T., et al. (2020). *Distributed Multi-Robot Collision Avoidance via Deep Reinforcement Learning for Navigation in Complex Scenarios.* The International Journal of Robotics Research (IJRR). 8. Mao, H., et al. (2020). *Learning Multi-Agent Communication with Double Attentional Deep Reinforcement Learning.* Autonomous Agents and Multi-Agent Systems (JAAMAS). 9. Yu, L., et al. (2025). *Hybrid Transformer Based Multi-Agent Reinforcement Learning for Multiple Unpiloted Aerial Vehicle Coordination in Air Corridors.* IEEE Transactions on Mobile Computing (TMC). 10. Zhu, Y., et al. (2025). *Multi-Task Multi-Agent Reinforcement Learning With Task-Entity Transformers and Value Decomposition Training.* IEEE Transactions on Automation Science and Engineering (TASE). 11. Jiang, C., et al. (2024). *Distributed Sampling-Based Model Predictive Control via Belief Propagation for Multi-Robot Formation Navigation.* IEEE Robotics and Automation Letters (RA-L). 12. Goeckner, A., et al. (2024). *Graph Neural Network-based Multi-Agent Reinforcement Learning for Resilient Distributed Coordination of Multi-Robot Systems.* IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).