TransfQMix Poster

State-of-the-art approaches in MARL (MADDPG, VDN, QTran, QMix, QPlex, MAPPO, DICG) focus on the learning model.
No looking into observations and states: concatenations of features fed to NNs.
Where do these features come from?

Information Channels

Information usually comes from multiple sources: observed entities, different sensors, communication channels.
Agents can differentiate between information channels, because they occupy always the same vector positions.

Entities Features

Often, a subset of identical features describes different observed entities.
This is a graph structure!
Self-Attention can be used to learn the edges of the latent graph Li et al., 2021

We reformulate the entire MARL problem in terms of graphs.
The inputs to NNs are well-structured vertex features.
The goal of the NNs is to learn the latent coordination graph.

$$ \begin{eqnarray} \mathbf{O}_t^a = \begin{bmatrix} ent_1 \\ \vdots \\ ent_k \end{bmatrix}_t^a = \begin{bmatrix} f_{1,1} & \cdots & f_{1,z} \\ \vdots & \ddots & \vdots \\ f_{k,1} & \cdots & f_{k,z} \end{bmatrix} _t^a \end{eqnarray} $$

$$ \begin{eqnarray} \mathbf{S}_t = \begin{bmatrix} ent_1 \\ \vdots \\ ent_k \end{bmatrix}_t = \begin{bmatrix} f_{1,1} & \cdots & f_{1,z} \\ \vdots & \ddots & \vdots \\ f_{k,1} & \cdots & f_{k,z} \end{bmatrix}_t \end{eqnarray} $$

Graph-Approach Advantages	Disadvantages
1. Better represents coordination problems.	1. Cannot differentiate entities a-priori in observations, for instance, raw images (although we can for states).
2. Allow use of more appropriate NNs (GNN, Transformers).	2. Cannot directly include last action and one hot encoding of agent id in the observation.
3. Makes the NNs parameters invariant to the number of agents.
4. Makes transfer learning and curriculum learning easy to implement.

Some information get lost when dropping concatenation:
- Which of the vertices represent "me"?
- Which of the vertices are "collaborative agents"?
Easier solutions are flags into the vertex features.

$$ \begin{eqnarray} f_{i,\texttt{IS_SELF}}^a = \begin{cases} 1, & \text{if } i = a\\ 0, & \text{otherwise.} \end{cases} \end{eqnarray} $$

$$ \begin{eqnarray} f_{i, \texttt{IS_AGENT}}^a = \begin{cases} 1, & \text{if } i \in A\\ 0, & \text{otherwise.} \end{cases} \end{eqnarray} $$

Transformer Mixer

Transformer hypernetwork:

Agents hidden states ($W_1$): q-values embedding.
Recurrent mechanism: ($b_1, W_2, b_2$): projection over $Q_{tot}$.
State embedder: environment reasoning.

Transformer Agent

Similar architecture to UPDET, but:

Q-Values are derived from the agent hidden state, reinforcing recurrent passing of gradient.
Separate output layer used for Policy Decoupling.

Spread

Spread 3v3

Spread 4v4

Spread 5v5

Spread 6v6

StarCraft 2

5m_vs_6m

8m_vs_9m

27m_vs_30m

6h_vs_8z

5s10z

3s5z_vs_3s6z

MMM2

corridor

TransfQMix parameters are inviariant in respect to the number of entities.
This enables transfer learning, curriculum learning and zero-shot transfer.

Models' Parameters Number

Model	Agent	Mixer
TransfQMix	50k	50k
QMix	27k	18k
QPlex	27k	251k
O-CWQMix	27k	179k

Spread 3v3

Model	Agent	Mixer
TransfQMix	50k	50k
QMix	28k	56k
QPlex	28k	597k
O-CWQMix	28k	301k

Spread 6v6

Model	Agent	Mixer
TransfQMix	50k	50k
QMix	49k	283k
QPlex	49k	3184k
O-CWQMix	49k	1021k

SC2 27m_vs_30m

Experiments

The learned policy is transferable between different team of agents.
Learning can be speeded up by transferring a learned policy.

Model	3v3	4v4	5v5	6v6
TransfQMix (3v3)	0.98	0.88	0.8	0.75
TransfQMix (4v4)	0.96	0.93	0.9	0.86
TransfQMix (5v5)	0.88	0.85	0.82	0.82
TransfQMix (6v6)	0.91	0.88	0.85	0.84
TransfQMix (CL)	0.88	0.88	0.87	0.87
State-of-the-art	0.76	0.45	0.36	0.33

Spread: Zero Shot Transfer (POL)

SC2: 8m_vs_9m to 5m_vs_6m

SC2: 5s10z to 3s5z_vs_3s6z

TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems

Matteo Gallici, Mario Martín, Ivan Masmitja

KEMLG Research Group, Universitat Politècnica de Catalunya, and Institut de Ciències del Mar (ICM-CSIC)

AAMAS 20223

Motivation

Information Channels

Entities Features

Observation and State Matrices

Model

Transformer Mixer

Transformer Agent

Experiments

Spread

StarCraft 2

Transfer Learning

Models' Parameters Number

Experiments