REINFORCE Policy Gradient
The algorithm for REINFORCE policy gradient is given as follows:
- Initialize the network parameter
with random values - Generate some N number of trajectories
following the policy 
- Compute the return of the trajectory

- Compute the gradients

- Update the network parameter,

- Repeat steps 2 to 5 for several iterations