Reinforcement learning (RL) is “learning by trial and error”: an agent takes actions in an environment, gets rewards, and updates a policy to get more reward over time.
Here the square is the agent. The circle is the reward. The only actions are up / down / left / right. PPO (Proximal Policy Optimization) trains the policy with gradient descent, but clips each update so it can’t change too much at once. Click/drag in the canvas to move the reward circle.
This is the update with numbers from the most recent PPO minibatch. The policy/value are tiny linear models, but the PPO idea is the same: compute an advantage, form an importance ratio r, clip it, and do gradient descent/ascent.