PPO explorer

Reinforcement learning (RL) is “learning by trial and error”: an agent takes actions in an environment, gets rewards, and updates a policy to get more reward over time.

Here the square is the agent. The circle is the reward. The only actions are up / down / left / right. PPO (Proximal Policy Optimization) trains the policy with gradient descent, but clips each update so it can’t change too much at once. Click/drag in the canvas to move the reward circle.

// State: relative position to the reward s = [ (xᵣ - x)/W, (yᵣ - y)/H, 1 ] // Linear policy + value (tiny “neural nets”) π(a|s) = softmax(Wπ s) // a ∈ {up, down, left, right} V(s) = wV · s // PPO clipped objective (maximize) r = exp( log π(a|s) - log π_old(a|s) ) L = E[ min( rA, clip(r, 1-ε, 1+ε)A ) ] // Gradient steps Wπ ← Wπ + α ∇_Wπ L wV ← wV - αᵥ ∇ ( V(s) - R )²

Train the square

Fixed target

Show learned policy field

Greedy actions (argmax)

Steps / sec12

Move step (px)14

Policy LR α0.18

Value LR αᵥ0.10

PPO clip ε0.20

Discount γ0.98

GAE λ0.95

Batch steps512

Update epochs6

The math this demo computes

This is the update with numbers from the most recent PPO minibatch. The policy/value are tiny linear models, but the PPO idea is the same: compute an advantage, form an importance ratio r, clip it, and do gradient descent/ascent.