PPO explorer

Reinforcement learning (RL) is “learning by trial and error”: an agent takes actions in an environment, gets rewards, and updates a policy to get more reward over time.

Here the square is the agent. The circle is the reward. The only actions are up / down / left / right. PPO (Proximal Policy Optimization) trains the policy with gradient descent, but clips each update so it can’t change too much at once. Click/drag in the canvas to move the reward circle.

// State: relative position to the reward s = [ (xᵣ - x)/W, (yᵣ - y)/H, 1 ] // Linear policy + value (tiny “neural nets”) π(a|s) = softmax( s) // a ∈ {up, down, left, right} V(s) = wV · s // PPO clipped objective (maximize) r = exp( log π(a|s) - log πold(a|s) ) L = E[ min( rA, clip(r, 1-ε, 1+ε)A ) ] // Gradient steps + α ∇ L wVwV - αᵥ ∇ ( V(s) - R

Train the square

12
14
0.18
0.10
0.20
0.98
0.95
512
6

The math this demo computes

This is the update with numbers from the most recent PPO minibatch. The policy/value are tiny linear models, but the PPO idea is the same: compute an advantage, form an importance ratio r, clip it, and do gradient descent/ascent.