Scaling Multi-Agent RL for Underwater Acoustic Tracking

GPU-accelerated simulation and Transformer-based MARL for cooperative underwater tracking.

  ·   6 min read

utracking_3v3

We are excited to present our ICRA 2026 paper on scaling Multi-Agent Reinforcement Learning (MARL) for cooperative underwater acoustic tracking. The core challenge: high-fidelity simulators like Gazebo’s LRAUV are too slow for MARL training, yet essential for closing the sim-to-real gap. We solve this with a two-part pipeline — a GPU-accelerated JAX environment (JaxLrauv) for fast training, and a Python interface to Gazebo (PyLrauv) for high-fidelity evaluation — combined with a Transformer-based MAPPO variant (TransfMAPPO) that learns fleet-size invariant policies via curriculum learning.

📝 paper (ICRA 2026)

🚀 jax implementation (JaxLrauv)


TL;DR #

Training cooperative fleets of autonomous underwater vehicles (AUVs) with MARL is bottlenecked by simulation speed. Gazebo’s LRAUV simulator runs at only ~10× real-time for a single robot — making MARL training completely impractical. Our contributions:

  1. JaxLrauv: A GPU-accelerated JAX environment that achieves up to 30,000× speedup over Gazebo while preserving its dynamics. Train a tracking policy in 10 minutes instead of weeks.
  2. PyLrauv: A Python/ROS2 interface to the high-fidelity LRAUV Gazebo simulator enabling seamless evaluation and real-robot deployment.
  3. TransfMAPPO: A Transformer-based MAPPO that learns policies invariant to fleet size and number of targets, enabling curriculum learning across progressively harder scenarios.
  4. Final policies track 5 simultaneous fast-moving targets with only 5 vehicles, maintaining tracking errors below 5 m — while state-of-the-art methods require up to 12 vehicles to track 4 targets.

⚡️ speedup #

JaxLrauv provides massive speedups over the Gazebo simulator across all agent–target configurations:

ConfigPyLrauv (SPS)JaxLrauv 1 EnvJaxLrauv 128 EnvsJaxLrauv 1024 Envs
1A, 1T2.7477×25,352×30,229×
2A, 2T1.01,084×20,017×21,867×
3A, 3T0.42,439×21,139×22,663×
4A, 4T0.33,571×18,125×19,070×
5A, 5T0.24,751×17,607×18,142×

These gains are the difference between training taking months vs. minutes.


🌊 the pipeline #

PyLrauv #

PyLrauv wraps the official Gazebo LRAUV C++ library with a Python/ROS2 interface, exposing a Gym-like API for controlling any number of LRAUV vehicles and targets in the high-fidelity simulator. Since real LRAUV vehicles use the same ROS2 stack, the same controllers work for real-world deployment.

The UTrackingEnv manages acoustic communication (via the Gazebo Acoustic Comms Plugin), range measurements, tracking model updates, observation construction, and reward calculation. A step lasts 30 seconds of simulated time — agents listen for range signals at the start and broadcast their positions and observations at the end.

JaxLrauv #

Gazebo models LRAUV dynamics at the millisecond level, but missions span hours. JaxLrauv focuses on what matters: the position change over a 30-second step. Given current position $\mathbf{p}_t$, speed $v$, and rudder angle $\gamma$, we predict the new position as:

$$\mathbf{p}_{t+1} = \mathbf{p}_t + v\,\delta_t \begin{bmatrix} \cos(\psi_t + \delta_\psi), & \sin(\psi_t + \delta_\psi) \end{bmatrix}$$

where $\delta_\psi = \theta(\gamma)$ is approximated by a linear model fit to Gazebo trajectories ($R^2 = 0.99$, MAE < 0.015 rad). All remaining aspects — sensor noise, communication dropout, partial observability, and collision handling — are implemented from Gazebo data.

The entire environment runs in JAX, including a vectorized Particle Filter that updates all particles across targets, agents, and parallel environments in a single GPU pass. This makes end-to-end GPU training possible.

pipeline overview Training pipeline: GPU-accelerated JaxLrauv for fast training, high-fidelity PyLrauv for evaluation.

🤖 TransfMAPPO #

Standard MAPPO with RNNs fails to generalize across different fleet sizes. TransfMAPPO addresses this by treating the multi-agent coordination problem as learning a latent coordination graph via self-attention.

Each agent (and the centralized critic) is implemented as a Transformer network. The $n$ agents and targets are represented as a set of entity vectors — each encoding relative 3D distance, velocity, and a one-hot role identifier (self / teammate / target). These vectors become the graph vertices; self-attention learns which entities to attend to and how.

This gives two key properties:

  • Permutation invariance: the Transformer doesn’t care about the order of agents/targets.
  • Size invariance: adding a new agent just means adding a new vertex — no architecture changes needed.

The TransformerAgent (actor) operates with local observations only, enabling decentralized execution. The TransformerCritic uses the full global state (true positions, velocities, orientations) and is used only during training (CTDE). Following our prior work, LayerNorm across the entire network is crucial for stability.

TransfMAPPO TransfMAPPO: both actors and centralized critic are Transformer networks processing entity sets.

📈 curriculum learning #

Fleet-size invariance enables a curriculum learning (CL) pipeline that progressively trains policies on increasingly hard scenarios:

  1. Single-agent pre-training ($10^{10}$ steps, ~1.5 days on H100): Train one agent to track a fast target (0.6× its velocity) with a tracking reward. We progressively extend episode length from 128 → 256 → 512 → 1024 steps until the policy tracks for 10,000+ steps (3+ days real-time) without losing the target.

  2. Multi-agent fine-tuning (up to $2^9$ steps): Initialize all agents from the single-agent checkpoint (identical weights, one per vehicle). Reset the critic once (new credit assignment structure). Fine-tune on two branches:

    • N agents, N targets (follow reward): progressively increase to 5 agents/targets.
    • N agents, 1 very fast target (tracking reward): progressively increase to 3 agents.
curriculum Curriculum learning: from single-agent tracking to coordinated multi-target following.

The benefits of CL vs. training from scratch are stark:

3v1 training Multi-robot tracking: curriculum vs. scratchnvn training Multi-target following: MAPPO struggles, curriculum succeeds

🔬 results in Gazebo #

Final policies trained in JaxLrauv are evaluated directly in the high-fidelity Gazebo simulator over 50 episodes. Performance transfers remarkably well:

ConfigurationAvg. Tracking Error (JaxLrauv)Avg. Tracking Error (Gazebo)P(lose target) Gazebo
1A, 1T (slow, 10min training)5.12 m5.89 m0%
1A, 1T (fast)17.40 m20.33 m7.1%
3A, 1T (very fast)2.65 m3.03 m0%
3A, 3T (moderate)4.85 m5.29 m5%
5A, 5T (moderate)3.80 m4.25 m5.3%

Key takeaways:

  • A model trained for just 10 minutes achieves reliable single-target tracking.
  • 3 agents tracking a very fast target achieves 3 m average error — far better than a single agent.
  • 5 agents following 5 targets with <5% probability of losing any target.
  • JaxLrauv and Gazebo results are closely matched, confirming sim-to-sim transfer.

emergent coordination #

The agents develop interpretable coordination strategies entirely from reward signals:

3v1 coordination 3 agents tracking a very fast target: agents wait for each other after a target direction change to preserve communication, then move together.5v5 coordination 5 agents following 5 targets: agents make circles to yield right-of-way, and dynamically reassign targets when a handover is beneficial.

Multi-robot evaluation across conditions (target speed, motion unpredictability, ocean current perturbations):

multi-robot evaluation Robustness increases monotonically with number of agents. Three agents is a sweet spot: negligible collision probability with near-optimal tracking.

  • JaxMARL — GPU-accelerated MARL environments in JAX
  • PureJaxRL — pure JAX RL implementations
  • PQN — our prior work on parallelised Q-learning
  • Mava — JAX-based MARL framework

citation #

@inproceedings{Gallici2025jaxlrauv,
  title={Scaling Multi-Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles},
  author={Matteo Gallici and Ivan Masmitja and Mario Martín},
  booktitle={IEEE International Conference on Robotics and Automation (ICRA)},
  year={2025},
}