Home > Posts > Parallelised Q-Network for Continuous Action Spaces

Parallelised Q-Network for Continuous Action Spaces

Extension of PQN to continuous action spaces.

November 10, 2025 · 5 min read

We are excited to introduce the extension of Parallelised Q-Network (PQN) to continuous action control. As always, PQN relies on parallelisation and network normalisation—no replay buffers, no target networks. Just basic Q-learning, now combined with a deterministic actor trained jointly with the critic in standard DDPG-style.

Exploration is achieved by injecting Gaussian noise off-policy, while stability emerges naturally from Layer Normalisation and large-scale parallelisation. PQN for continuous control can learn compltely on GPU many robotic and control tasks within MuJoCo playground, making the resulting policies directly transferable to real robots.

👀 what is pqn? 👀

🚀 jax implementation 🚀

⚡️ quick stats #

We evaluate Actor–Critic PQN across three main domains of Mujoco Playground, for a total of 50 tasks:

DeepMind Control Suite – Classic continuous control benchmarks including CartPole, Walker, Cheetah, and Hopper.
Locomotion Tasks – Control of quadrupeds and humanoids such as Unitree Go1, Boston Dynamics Spot, Google Barkour, Unitree H1/G1, Berkeley Humanoid, Booster T1, and Robotis OP3.
Manipulation Tasks – Prehensile and non-prehensile manipulation using robotic arms and hands, such as the Franka Emika Panda and Robotiq gripper.

Our baselines include PPO (the original implementation from mujoco playground) and FastTD3 (the state-of-the-art off-policy algorithm for continuous control, using target networks and replay buffers).

Following results are averaged across 5 seeds and training all algorithms for a maximum of 1e8 environment steps using H100 GPUs.

DM Suite

CartPole, Walker, Cheetah, Hopper, ...

Locomotion

Unitree Go1, H1/G1, Booster T1, Berkeley Humanoid, ...

Manipulation

Franka Emika Panda, Robotiq gripper, ...

🎭 actor–critic PQN #

PQN can be naturally extended to continuous action spaces by adopting an actor–critic framework, in which an additional deterministic actor network is trained jointly with the $Q$-network (critic).
This configuration closely resembles the Deep Deterministic Policy Gradient (DDPG), while preserving PQN’s online, parallelised, and normalised structure.

PQN remains a strong and minimal baseline for continuous control, without requiring additional regularisation such as twin delayed target networks, entropy bonuses, replay buffers, or other stabilisation tricks from TD3 and SAC.

critic update #

The critic $Q_\phi(s,a)$ is trained exactly as in the discrete version of PQN, by minimising the regularised temporal-difference loss over $\lambda$-returns. We use standard huber loss in this case.

actor update #

The actor $\pi_\theta(s)$ is trained to produce deterministic actions that maximise the critic’s estimated $Q$-values:

$$\mathcal{L}_\text{actor} = -\mathbb{E}_{s_t}\big[ Q_\phi(s_t, \pi_\theta(s_t)) \big].$$

exploration #

Exploration in continuous control is achieved by perturbing the deterministic policy with additive Gaussian noise rather than using $\epsilon$-greedy exploration. Specifically,

$$a_t = \pi_\theta(s_t) + \mathcal{N}(0, \sigma_t^2),$$

where $\mathcal{N}$ denotes a Gaussian distribution with decaying standard deviation $\sigma_t$, analogous to the decaying $\epsilon$ in discrete control.

stability #

To further stabilise online training, we can regularise large deviations in the actor’s output. The actor loss becomes:

$$\mathcal{L}_{\text{actor}} = - Q_\phi(s_t, \pi_{\theta}(s_t)) + \beta \max\big(0, |\pi_{\theta}(s_t) - \tilde{a}_t| - \tau \big)^2,$$

where $\tilde{a}_t$ is the previous (noise-free) action, $\beta$ controls the penalty, and $\tau$ is the deviation threshold. In simple words, when the actor’s output deviates too much from the last action taken, a quadratic penalty is applied to discourage large jumps in action space.

Despite this mild regularisation, the primary stabilising factors remain Layer Normalisation within networks and large-scale vectorisation, which ensures diverse and ergodic sampling.

🔬 ablations #

We perform ablation studies in the DeepMind Control Suite to isolate the contribution of each design component.

LayerNorm and Actor Regularisation – Both independently improve stability; the best results are achieved when they are combined.
Twin Critics – As in TD3, using two critics to mitigate overestimation actually harms performance in this setting. Averaging across a critic ensemble performs slightly better, but similar results can be achieved by simply widening the critic.
Vectorisation Scale – Increasing the number of parallel environments (up to thousands) significantly enhances stability and convergence speed, mirroring the discrete PQN findings.

Stabilisation ablation

Critic type ablation

#Environments ablation

👾 purejaxql #

We now provide a unified repository where you can train PQN across a wide range of tasks — from Atari, Craftax, and multi-agent domains to the new robotic control tasks powered by Actor–Critic PQN.

simplified jax scripts #

Additionally, we now also provide simplified jax scripts at purejaxql/simplified for smoothing the jax learning curve. These scripts are designed to be more accessible and easier to understand for those who are new to JAX. They cover basic implementations of PQN for various environments, including MinAtar, Atari and Mujoco Playground:

complete results #

Following results are averaged across 5 seeds and training all algorithms for a maximum of 1e8 environment (5e7 for FastTD3 and PQN in DM Suite) steps using H100 GPUs. PQN performs very well on most of DMSuite and Locomotion tasks, but seems to struggle more in the manipulation domain. Some environments are not included in the plots since we were not able to replicate original results, or the environment consistently returned NaN rewards during training. These include:

DM Suite: HumanoidStand, HumanoidWalk, HumanoidRun
Locomotion: BarkourJoystick.
Manipulation: PandaOpenCabinet.