A Second Head That Silently Corrupted PPO’s Ratio Math
I added a card-choice head, wired its log-probs into the step log-probs, and made the PPO clip ratio a lie for weeks before I noticed.
There are four heads on this network. I’ve written about three of
them before — a policy head over game actions, a value head, a
small discard-priority helper. This post is about the fourth. It’s
one nn.Linear’s worth of code. And between the week I
added it and the week I fixed it, the training loop was quietly doing the
wrong math every time a card-reward prompt fired.
The plain-English version of the bug: training has a self-check that compares “how likely the current policy is to do this action” against “how likely it was at the moment the action was taken.” I fed a number into the second half of that comparison that the first half could never reproduce — so the self-check was quietly always a little bit off. Nothing blew up. A loss curve drifted in a direction I didn’t understand. That’s the whole story; the rest is how I got there and how I finally saw it.
The fourth head goes in
- Added
choice_proj = nn.Linear(HIDDEN_DIM, CARD_EMB_DIM)as a fourth head. - Wrote
choose_cards(enc, option_card_ids, n_select)— project trunk hidden state, dot against card embeddings of the offered cards, sample without replacement. - Wired it into
actor_proxy._handle_choice, which the environment calls on card-reward prompts.
The curriculum was approaching the point where the agent would start making deck-building decisions, and a card-reward picker seemed like the smallest possible step in that direction. One linear layer, trained end-to-end with the rest of the network — or so I told myself.
Nothing visible. The head ran. The agent picked cards. Reward curves kept going up. The corruption had already started but the symptoms were days away.
The log-prob combination that looked obvious
choose_cards returns a log-prob — a numeric
record of “how likely the policy said this pick was.”
PPO uses log-probs on every transition as old_lp, its
stored snapshot of that likelihood at the moment the action fired.
The obvious thing, which I did without thinking about it too hard,
was to add the choice log-prob to the step log-prob:
# what I shipped
step_log_prob = main_log_prob + choice_log_prob I remember writing it. I remember not writing a test. That’s the receipt.
Short version: PPO’s core trick is a ratio — the importance weight — that compares the policy’s current opinion of an action against what it thought at rollout time. If the learner can’t reconstruct both sides of that ratio, the whole update is scaled by a number that’s subtly wrong.
choice_proj has no gradient path from the PPO loss.
PPO’s update scales by that importance ratio between the
current policy and the policy under which the rollout was collected
— both sides of that subtraction have to be recoverable at
training time.
The policy head’s log-prob survives the round trip. choose_cards does not. Its
log-prob is the sum of per-pick log-probs over a
remaining_mask that evolves with the selection order,
and the learner has no way to reproduce that from the stored state.
Including it in old_lp makes the ratio on every choice
step off by exp(choice_log_prob).
Value-loss drift around choice events
Assumed training was fine because nothing was on fire. Reward curves tracked. Stage-0 win rate climbed. Entropy looked okay. What I was mostly monitoring was whether the runs blew up, and they weren’t.
Plain version: the net has a companion estimate of “how good is the current state” — the value function — and the training loss for that estimate is supposed to settle as it learns. Mine wasn’t settling. It was slowly walking in a direction I couldn’t attribute to anything.
Value loss had a persistent drift I couldn’t explain. Not a spike, not a divergence — a drift. Every time I pinned the window to a specific rollout segment and looked at the transitions, the segment contained at least one combat-end choice prompt. Once I saw that pattern three times it stopped being a coincidence.
The failure mode was not the one I was watching for. I had been waiting for training to blow up — a NaN, a loss spike, a collapse. What I got was worse: training looked fine, but the policy gradient was being pushed by numbers that didn’t mean what PPO expected them to mean.
One line and a comment
Dropped the choice log-prob from the step log-prob. Left a comment above the line so the next version of me doesn’t re-add it.
# Don't include choice log_probs — PPO can't re-evaluate them
# during training (choice_proj has no gradient path in the PPO loss).
# Including them corrupts the importance sampling ratio.
step_log_prob = log_prob.item()
PPO’s ratio math holds end-to-end. The clip term — the
safety rail that caps how far a single update moves the policy
— means what it’s supposed to mean on every step,
including steps where a choice fired. choice_proj
still picks cards; its log-probs are just honest about not being
re-evaluable.
Training it is a separate problem. The current plan is a frozen-combat, V-as-reward setup on the choice head alone, which sidesteps the re-evaluation question entirely by not putting the choice log-prob through PPO in the first place.
What past-me needed to hear
Before you add a second head, enumerate every tensor it produces and ask:
does PPO’s ratio math still hold for this, end-to-end? The answer
has to be yes on every step, not mostly, and “we’ll figure out
training later” is the same shape as “this log-prob is wrong
but I’ll ignore it.” If the learner can’t re-evaluate a
log-prob from the stored state, that log-prob can’t be in
old_lp. Drop it, or add a gradient path first.
The other lesson is about what looks like a training failure. Silent corruption doesn’t spike a loss curve. It biases a gradient in a specific direction on a specific subset of transitions. Reward curves kept going up because the policy was still improving on the combat actions that weren’t corrupted; the choice-step corruption was dragging against that improvement but not enough to reverse it. What it cost, quantitatively, I can’t say — by the time I noticed, I was re-running training from the fix forward, not A/B-testing against the broken version.
Takeaways
- Any log-prob that can’t be re-evaluated at training time is not a log-prob PPO can use. Drop it or give it a gradient path.
- “Training looks fine” is not an all-clear — silent corruption leaves the loud signals alone and biases the quiet ones.
- Before adding a second head, trace every tensor it produces through the update math. The ratio has to hold end-to-end.
- Comments above the fix, not just the fix. Future-me reads the comment before the code.