Trust Region Policy Optimization
The algorithm for Trust Region Policy Optimization (TRPO) is given as follows:
- Initialize the policy network parameter
and value network parameter 
- Generate N number of trajectories
following the policy 
- Compute the return (reward-to-go) Rt
- Compute the advantage value At
- Compute the policy gradients

- Compute
using the conjugate gradient method - Update the policy network parameter
using the update rule 
- Compute the mean squared error of the value network,

- Update the value network parameter
using gradient descent, 
- Repeat steps 2 to 9 for several iterations