PPO-Clipped
The algorithm for the PPO-clipped method is given as follows:
- Initialize the policy network parameter
and value network parameter 
- Collect some N number of trajectories
following the policy 
- Compute the return (reward-to-go) Rt
- Compute the gradient of the objective function

- Update the policy network parameter
using gradient ascent, 
- Compute the mean squared error of the value network,

- Compute the gradient of the value network

- Update the value network parameter
using gradient descent, 
- Repeat steps 2 to 8 for several iterations