On-Policy TD Control – SARSA
The algorithm for on-policy TD control – SARSA is given as follows:
- Initialize the Q function Q(s, a) with random values
- For each episode:
- Initialize the state s
- Extract a policy from Q(s, a) and select an action a to perform in the state s
- For each step in the episode:
- Perform the action a, move to the new state
, and observe the reward r - In the state
, select the action
using the epsilon-greedy policy - Update the Q value to

- Update
and
(update the next state
-action
pair to the current state s-action a pair) - If s is not the terminal state, repeat steps 1 to 5
- Perform the action a, move to the new state