A Second Head That Silently Corrupted PPO’s Ratio Math

What I Learned · 2026-04-19 · 7 min read

I added a card-choice head, wired its log-probs into the step log-probs, and made the PPO clip ratio a lie for weeks before I noticed.

The Outcome

weeks

Silent corruption window

value-loss drift · no loud failure

4 Heads on the net

1 line Size of the fix

choice No grad path

ratio What broke

There are four heads on this network. I’ve written about three of them before — a policy head over game actions, a value head, a small discard-priority helper. This post is about the fourth. It’s one nn.Linear’s worth of code. And between the week I added it and the week I fixed it, the training loop was quietly doing the wrong math every time a card-reward prompt fired.

The plain-English version of the bug: training has a self-check that compares “how likely the current policy is to do this action” against “how likely it was at the moment the action was taken.” I fed a number into the second half of that comparison that the first half could never reproduce — so the self-check was quietly always a little bit off. Nothing blew up. A loss curve drifted in a direction I didn’t understand. That’s the whole story; the rest is how I got there and how I finally saw it.

ADD

The fourth head goes in

What I did

Added choice_proj = nn.Linear(HIDDEN_DIM, CARD_EMB_DIM) as a fourth head.
Wrote choose_cards(enc, option_card_ids, n_select) — project trunk hidden state, dot against card embeddings of the offered cards, sample without replacement.
Wired it into actor_proxy._handle_choice, which the environment calls on card-reward prompts.

The curriculum was approaching the point where the agent would start making deck-building decisions, and a card-reward picker seemed like the smallest possible step in that direction. One linear layer, trained end-to-end with the rest of the network — or so I told myself.

What went wrong

Nothing visible. The head ran. The agent picked cards. Reward curves kept going up. The corruption had already started but the symptoms were days away.

silent· not yet visible

SHIP

The log-prob combination that looked obvious

What I did

choose_cards returns a log-prob — a numeric record of “how likely the policy said this pick was.” PPO uses log-probs on every transition as old_lp, its stored snapshot of that likelihood at the moment the action fired. The obvious thing, which I did without thinking about it too hard, was to add the choice log-prob to the step log-prob:

# what I shipped
step_log_prob = main_log_prob + choice_log_prob

I remember writing it. I remember not writing a test. That’s the receipt.

What went wrong

Short version: PPO’s core trick is a ratio — the importance weight — that compares the policy’s current opinion of an action against what it thought at rollout time. If the learner can’t reconstruct both sides of that ratio, the whole update is scaled by a number that’s subtly wrong.

choice_proj has no gradient path from the PPO loss. PPO’s update scales by that importance ratio between the current policy and the policy under which the rollout was collected — both sides of that subtraction have to be recoverable at training time.

The policy head’s log-prob survives the round trip. choose_cards does not. Its log-prob is the sum of per-pick log-probs over a remaining_mask that evolves with the selection order, and the learner has no way to reproduce that from the stored state. Including it in old_lp makes the ratio on every choice step off by exp(choice_log_prob).

ratio is a lie· on every choice step

DRIFT

Value-loss drift around choice events

What I did

Assumed training was fine because nothing was on fire. Reward curves tracked. Stage-0 win rate climbed. Entropy looked okay. What I was mostly monitoring was whether the runs blew up, and they weren’t.

What went wrong

Plain version: the net has a companion estimate of “how good is the current state” — the value function — and the training loss for that estimate is supposed to settle as it learns. Mine wasn’t settling. It was slowly walking in a direction I couldn’t attribute to anything.

Value loss had a persistent drift I couldn’t explain. Not a spike, not a divergence — a drift. Every time I pinned the window to a specific rollout segment and looked at the transitions, the segment contained at least one combat-end choice prompt. Once I saw that pattern three times it stopped being a coincidence.

The failure mode was not the one I was watching for. I had been waiting for training to blow up — a NaN, a loss spike, a collapse. What I got was worse: training looked fine, but the policy gradient was being pushed by numbers that didn’t mean what PPO expected them to mean.

drift· not a spike

FIX

One line and a comment

What I did

Dropped the choice log-prob from the step log-prob. Left a comment above the line so the next version of me doesn’t re-add it.

# Don't include choice log_probs — PPO can't re-evaluate them
# during training (choice_proj has no gradient path in the PPO loss).
# Including them corrupts the importance sampling ratio.
step_log_prob = log_prob.item()

What works now

PPO’s ratio math holds end-to-end. The clip term — the safety rail that caps how far a single update moves the policy — means what it’s supposed to mean on every step, including steps where a choice fired. choice_proj still picks cards; its log-probs are just honest about not being re-evaluable.

Training it is a separate problem. The current plan is a frozen-combat, V-as-reward setup on the choice head alone, which sidesteps the re-evaluation question entirely by not putting the choice log-prob through PPO in the first place.

ratio honest· end-to-end

What past-me needed to hear

Before you add a second head, enumerate every tensor it produces and ask: does PPO’s ratio math still hold for this, end-to-end? The answer has to be yes on every step, not mostly, and “we’ll figure out training later” is the same shape as “this log-prob is wrong but I’ll ignore it.” If the learner can’t re-evaluate a log-prob from the stored state, that log-prob can’t be in old_lp. Drop it, or add a gradient path first.

The other lesson is about what looks like a training failure. Silent corruption doesn’t spike a loss curve. It biases a gradient in a specific direction on a specific subset of transitions. Reward curves kept going up because the policy was still improving on the combat actions that weren’t corrupted; the choice-step corruption was dragging against that improvement but not enough to reverse it. What it cost, quantitatively, I can’t say — by the time I noticed, I was re-running training from the fix forward, not A/B-testing against the broken version.

Takeaways

Any log-prob that can’t be re-evaluated at training time is not a log-prob PPO can use. Drop it or give it a gradient path.
“Training looks fine” is not an all-clear — silent corruption leaves the loud signals alone and biases the quiet ones.
Before adding a second head, trace every tensor it produces through the update math. The ratio has to hold end-to-end.
Comments above the fix, not just the fix. Future-me reads the comment before the code.