This blog provides a general overview of PQN. For a deeper theoretical analysis of PQN, check out this other blog post.
TL;DR #
We leverage environment vectorisation and network normalisation to train a modern version of Deep Q-Learning that is fast and stable. We call the resulting algorithm PQN (Parallelised Q-Network). The main features of PQN are:
- Simplicity: PQN is a very simple baseline, essentially an online Q-learner with vectorised environments and network normalisation.
- Speed: PQN runs without a replay buffer and target networks, ensuring significant speed-ups and sample efficiency.
- Stability: PQN uses network normalisation to stabilise training.
- Flexibility: PQN is easily compatible with RNNs, $Q(\lambda)$ returns, and multi-agent tasks.
Useful links
π Jax implementation
π₯ PyTorch Cleanrl implementation
π₯οΈ Colab Demo
π Paper
π We're very excited to introduce Parallelised Q-Network (PQN), the result of an effort to bring Q-Learning into the world of pure-GPU training based on JAX!
— Matteo Gallici (@MatteoGallici) July 12, 2024
Whatβs the issue? Pure-GPU training can accelerate RL by orders of magnitude. However, Q-Learning heavily relies onβ¦ pic.twitter.com/aBA0IPF0By
Abstract #
Q-learning has played a foundational role in the field of reinforcement learning (RL). However, TD algorithms with off-policy data, such as Q-learning, or nonlinear function approximation like deep neural networks, require several additional tricks to stabilise training, primarily a replay buffer and target networks. Unfortunately, the delayed updating of frozen network parameters in the target network harms sample efficiency, and similarly, the replay buffer introduces memory and implementation overheads. In this paper, we investigate whether it is possible to accelerate and simplify TD training while maintaining its stability. Our key theoretical result demonstrates for the first time that regularisation techniques such as LayerNorm can yield provably convergent TD algorithms without the need for a target network, even with off-policy data. Empirically, we find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a replay buffer. Motivated by these findings, we propose PQN, our simplified deep online Q-Learning algorithm. Surprisingly, this simple algorithm is competitive with more complex methods like Rainbow in Atari, R2D2 in Hanabi, QMix in Smax, and PPO-RNN in Craftax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. In an era where PPO has become the go-to RL algorithm, PQN re-establishes Q-learning as a viable alternative.
β‘οΈ Quick Stats #
With PQN and a single NVIDIA A40 (achieving similar performance to an RTX 3090), you can:
- π¦Ώ Train agents for simple tasks like CartPole and Acrobot in a few seconds.
- Train thousands of seeds in parallel in a few minutes.
- Train MinAtar in less than a minute, and complete 10 parallel seeds in less than 5 minutes.
- πΉοΈ Train an Atari agent for 200M frames in one hour (with environments running on a single CPU using Envpool, tested on an AMD EPYC 7513 32-Core Processor).
- Solve simple games like Pong in a few minutes and less than 10M timesteps.
- πΎ Train a Q-Learning agent in Craftax much faster than when using a replay buffer.
- π₯ Train a strong Q-Learning baseline with VDN in multi-agent tasks.
Cartpole![]() | Atari![]() | Craftax![]() |
π¦Ύ Performances #
πΉοΈ Atari #
Currently, after around 4 hours of training and 400M environment frames, PQN can achieve a median score similar to the original Rainbow paper in ALE, reaching scores higher than humans in 40 of the 57 Atari games. While this is far from the latest SOTA in ALE, it can serve as a good starting point for faster research in ALE.
Median Score
![]() | Performance Profile
![]() | Training Speed
![]() |
πΎ Craftax #
When combined with an RNN network, PQN offers a more sample-efficient baseline compared to PPO. As an off-policy algorithm, PQN could be an interesting starting point for population-based training in Craftax!

π₯ Multi-Agent (JaxMarl) #
When combined with Value Decomposition Networks, PQN is a strong baseline for multi-agent tasks.
Smax
![]() | Overcooked
![]() |
π§ How does it work #
The idea behind PQN is simple. Originally, DQN used a replay buffer to sample batches of experiences, as training in batches is crucial for deep learning. However, if we can acquire enough experiences directly from a parallelised interaction with an environment and use them for our batches, storing experiences in a replay buffer is no longer necessary. This is the idea behind algorithms like A2C and PPO, and it can be very beneficial in modern pure-GPU training, where we have the ability to sample many experiences in parallel but a limited amount of GPU memory to store them (note that this is the opposite scenario to the one in which DQN was originally implemented, where sampling experiences from single environments on the CPU was expensive, but storing experiences in RAM or on disk was cheap). However, A2C and PPO focus on on-policy algorithms. PQN demonstrates that, if vectorisation is done properly and learning is stabilised via network normalisation, we can apply the same regime used by A2C and PPO to an off-policy algorithm, such as Q-Learning with epsilon-greedy. This effectively brings standard Q-Learning into a pure-GPU regime!

To train an off-policy algorithm in a vectorised manner, we need a fast and stable algorithm. This is where network normalisation comes into play. Check out our other blog post, where we delve into the theory behind the role of network normalisation in TD Learning, and we avvocate for the use of Layer Normalisation, showing how it can effectively stabilise training without the need for target networks.
π·ββοΈ Code example #
Implementing and using PQN is simple. The following self-contained snippet can be used to train fully on GPU PQN in Minatar, including testing, WANDB logging, and running multiple runs (seeds) in parallel on the same GPU. You can try this code in Colab now!
Related Projects #
The following repositories are related to pure-GPU RL training:
Citation #
@misc{gallici2024simplifyingdeeptemporaldifference,
title={Simplifying Deep Temporal Difference Learning},
author={Matteo Gallici and Mattie Fellows and Benjamin Ellis and Bartomeu Pou and Ivan Masmitja and Jakob Nicolaus Foerster and Mario Martin},
year={2024},
eprint={2407.04811},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.04811},
}