Categorical DQN
The algorithm for a categorical DQN is given as follows:
- Initialize the main network parameter
with random values - Initialize the target network parameter
by copying the main network parameter 
- Initialize the replay buffer
, the number of atoms, and also
and 
- For N number of episodes, perform step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1:
- Feed the state s and support values to the main categorical DQN parameterized by
, and get the probability value for each support value. Then compute the Q value as
. - After computing the Q value, select an action using the epsilon-greedy policy, that is, with probability epsilon, select a random action a and with probability 1-epsilon, select an action as
. - Perform the selected action and move to the next state
and obtain the reward r. - Store the transition information in the replay buffer
. - Randomly sample a transition...
- Feed the state s and support values to the main categorical DQN parameterized by