Distributed Distributional DDPG
The Distributed Distributional Deep Deterministic Policy Gradient (D4PG) algorithm is given as follows:
- Initialize the critic network parameter
and the actor network parameter 
- Initialize the target critic network parameter
and the target actor network parameter
by copying from
and
, respectively - Initialize the replay buffer

- Launch L number of actors
- For N number of episodes, repeat step 6
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Randomly sample a minibatch of K transitions from the replay buffer

- Compute the target value distribution of the critic, that is,

- Compute the loss of the critic network and calculate the gradient as

- After computing gradients, update the critic network parameter using gradient descent:

- Compute the gradient of the actor network

- Update the actor network parameter by gradient ascent:
...
- Randomly sample a minibatch of K transitions from the replay buffer