Soft Actor-Critic
The algorithm for Soft Actor-Critic (SAC) is given as follows:
- Initialize the main value network parameter
, the Q network parameters
and
, and the actor network parameter 
- Initialize the target value network
by just copying the main value network parameter 
- Initialize the replay buffer

- For N number of episodes, repeat step 5
- For each step in the episode, that is, for t = 0, . . ., T – 1
- Select action a based on the policy
, that is, 
- Perform the selected action a, move to the next state
, get the reward r, and store the transition information in the replay buffer 
- Randomly sample a minibatch of K transitions from the replay buffer
- Compute target state value

- Compute the loss of value network
and update the parameter using gradient descent, 
- Compute the target Q value

- Compute the loss of the Q networks
and update the parameter using gradient descent...
- Select action a based on the policy