Twin Delayed DDPG
The algorithm for Twin Delayed DDPG (TD3) is given as follows:
- Initialize two main critic networks parameters,
and
, and the main actor network parameter 
- Initialize two target critic networks parameters,
and
, by copying the main critic network parameters
and
, respectively - Initialize the target actor network parameter
by copying the main actor network parameter 
- Initialize the replay buffer

- For N number of episodes, repeat step 6
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Select action a based on the policy
and with exploration noise
, that is,
where, 
- Perform the selected action a, move to the next state
, get the reward r, and store the transition information in the replay buffer 
- Randomly sample a minibatch of K transitions from the replay buffer

- Select the action
for computing the target value
where 
- Compute the target value of the...
- Select action a based on the policy