PPO-Penalty
The algorithm for the PPO-penalty method is given as follows:
- Initialize the policy network parameter
and value network parameter
and initialize the penalty coefficient
and the target KL divergence 
- For iterations
:- Collect some N number of trajectories following the policy

- Compute the return (reward-to-go) Rt
- Compute
- Compute the gradient of the objective function

- Update the policy network parameter
using gradient ascent, 
- If d is greater than or equal to
, then we set
; if d is lesser than or equal to
, then we set, 
- Compute the mean squared error of the value network:

- Compute the gradients of the value network

- Update the value network parameter
using gradient descent, 
- Collect some N number of trajectories following the policy