Double DQN
The algorithm for double DQN is given as follows:
- Initialize the main network parameter
with random values - Initialize the target network parameter
by copying the main network parameter 
- Initialize the replay buffer

- For N number of episodes, repeat step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Observe the state s and select an action using the epsilon-greedy policy, that is, with probability epsilon, select random action a with probability 1-epsilon, and select the action as

- Perform the selected action and move to the next state
and obtain the reward r - Store the transition information in the replay buffer

- Randomly sample a minibatch of K transitions from the replay buffer

- Compute the target value, that is,

- Compute the loss,

- Compute the gradients of the loss and update the main network parameter
using gradient descent:
...
- Observe the state s and select an action using the epsilon-greedy policy, that is, with probability epsilon, select random action a with probability 1-epsilon, and select the action as