Motivation

  • State-of-the-art approaches in MARL (MADDPG, VDN, QTran, QMix, QPlex, MAPPO, DICG) focus on the learning model.
  • No looking into observations and states: concatenations of features fed to NNs.
  • Where do these features come from?

Information Channels

  • Information usually comes from multiple sources: observed entities, different sensors, communication channels.
  • Agents can differentiate between information channels, because they occupy always the same vector positions.

Entities Features

  • Often, a subset of identical features describes different observed entities.
  • This is a graph structure!
  • Self-Attention can be used to learn the edges of the latent graph Li et al., 2021

Observation and State Matrices

  • We reformulate the entire MARL problem in terms of graphs.
  • The inputs to NNs are well-structured vertex features.
  • The goal of the NNs is to learn the latent coordination graph.

$$ \begin{eqnarray} \mathbf{O}_t^a = \begin{bmatrix} ent_1 \\ \vdots \\ ent_k \end{bmatrix}_t^a = \begin{bmatrix} f_{1,1} & \cdots & f_{1,z} \\ \vdots & \ddots & \vdots \\ f_{k,1} & \cdots & f_{k,z} \end{bmatrix} _t^a \end{eqnarray} $$

$$ \begin{eqnarray} \mathbf{S}_t = \begin{bmatrix} ent_1 \\ \vdots \\ ent_k \end{bmatrix}_t = \begin{bmatrix} f_{1,1} & \cdots & f_{1,z} \\ \vdots & \ddots & \vdots \\ f_{k,1} & \cdots & f_{k,z} \end{bmatrix}_t \end{eqnarray} $$

Graph-Approach Advantages Disadvantages
1. Better represents coordination problems. 1. Cannot differentiate entities a-priori in observations, for instance, raw images (although we can for states).
2. Allow use of more appropriate NNs (GNN, Transformers). 2. Cannot directly include last action and one hot encoding of agent id in the observation.
3. Makes the NNs parameters invariant to the number of agents.
4. Makes transfer learning and curriculum learning easy to implement.
  • Some information get lost when dropping concatenation:
    • Which of the vertices represent "me"?
    • Which of the vertices are "collaborative agents"?
  • Easier solutions are flags into the vertex features.

$$ \begin{eqnarray} f_{i,\texttt{IS_SELF}}^a = \begin{cases} 1, & \text{if } i = a\\ 0, & \text{otherwise.} \end{cases} \end{eqnarray} $$

$$ \begin{eqnarray} f_{i, \texttt{IS_AGENT}}^a = \begin{cases} 1, & \text{if } i \in A\\ 0, & \text{otherwise.} \end{cases} \end{eqnarray} $$

Model

Transformer Mixer

Transformer hypernetwork:

  • Agents hidden states (\(W_1\)): q-values embedding.
  • Recurrent mechanism: (\(b_1, W_2, b_2\)): projection over \(Q_{tot}\).
  • State embedder: environment reasoning.

Transformer Agent

Similar architecture to UPDET, but:

  • Q-Values are derived from the agent hidden state, reinforcing recurrent passing of gradient.
  • Separate output layer used for Policy Decoupling.

Experiments

Spread

Spread 3v3

Spread 4v4

Spread 5v5

Spread 6v6

StarCraft 2

5m_vs_6m

8m_vs_9m

27m_vs_30m

6h_vs_8z

5s10z

3s5z_vs_3s6z

MMM2

corridor

Transfer Learning

  • TransfQMix parameters are inviariant in respect to the number of entities.
  • This enables transfer learning, curriculum learning and zero-shot transfer.

Models' Parameters Number

Model Agent Mixer
TransfQMix 50k 50k
QMix 27k 18k
QPlex 27k 251k
O-CWQMix 27k 179k

Spread 3v3

Model Agent Mixer
TransfQMix 50k 50k
QMix 28k 56k
QPlex 28k 597k
O-CWQMix 28k 301k

Spread 6v6

Model Agent Mixer
TransfQMix 50k 50k
QMix 49k 283k
QPlex 49k 3184k
O-CWQMix 49k 1021k

SC2 27m_vs_30m

Experiments

  • The learned policy is transferable between different team of agents.
  • Learning can be speeded up by transferring a learned policy.
Model 3v3 4v4 5v5 6v6
TransfQMix (3v3) 0.98 0.88 0.8 0.75
TransfQMix (4v4) 0.96 0.93 0.9 0.86
TransfQMix (5v5) 0.88 0.85 0.82 0.82
TransfQMix (6v6) 0.91 0.88 0.85 0.84
TransfQMix (CL) 0.88 0.88 0.87 0.87
State-of-the-art 0.76 0.45 0.36 0.33

Spread: Zero Shot Transfer (POL)

SC2: 8m_vs_9m to 5m_vs_6m

SC2: 5s10z to 3s5z_vs_3s6z