REINFORCE with Baseline
The algorithm for REINFORCE with baseline is given as follows:
- Initialize the policy network parameter
and value network parameter 
- Generate some N number of trajectories
following the policy 
- Compute the return (reward-to-go) Rt
- Compute the policy gradient,

- Update the policy network parameter
using gradient ascent, 
- Compute the mean squared error of the value network,

- Update the value network parameter
using gradient descent, 
- Repeat steps 2 to 7 for several iterations