The Card-Choice Head: the Fourth Head on the Trunk
The Neural Network deep dive named three heads. This post documents the fourth — a card-reward picker that scores offered cards by projecting the policy hidden state into card-embedding space, samples without replacement, and carries no PPO gradient path.
Why this head exists
The combat policy I’ve been writing about in the other Deep Dives plays turns, cards, and targets. It ends with a win or a loss. What it doesn’t do, yet, is choose the cards that go into the deck. The card-choice head is the smallest step toward that — a head that fires only at out-of-combat reward prompts and picks one of the offered cards.
Card-reward prompts look like this: after a combat ends, the game
offers a small set of new cards and the player picks one (or sometimes
skips). Environmentally, it’s a choice_handler
callback: the C# mod passes options: list[dict] plus
min_select / max_select counts, and the actor
side returns a list of selected indices plus a log-prob. The
head’s whole job is to produce that selection.
The curriculum reaches the point where this matters at what I’ve been calling “stage 10.5” — a planned extension past the shipped 13-stage curriculum, a V-as-reward deck-building milestone where the trained combat policy stops playing random decks and starts building its own. Not live yet; the head exists for that moment. Everything else in this article is the machinery that gets it ready.
The projection
self.choice_proj = nn.Linear(HIDDEN_DIM, CARD_EMB_DIM)
One linear layer, no nonlinearity on the way out. HIDDEN_DIM
is 256 — the output of the shared trunk. CARD_EMB_DIM
is 16 — the width of the card embedding table. The projection
maps policy hidden state into the same 16-dim space the card embeddings
live in.
The choice to make this a nn.Linear rather than a small
MLP was deliberate. A multi-layer projection would have more
representational capacity, but it would also need to re-learn the card
representation from scratch — there’s no gradient signal
reaching the choice head from the shared trunk during combat steps,
because the choice head only fires at card-reward prompts. The linear
projection keeps the parameter count tiny (256 × 16 = 4096
weights plus 16 biases) and keeps the relationship between hidden state
and card space invertibly simple.
The scoring
h = self._trunk_forward(enc) # [1, HIDDEN_DIM]
proj = self.choice_proj(h) # [1, CARD_EMB_DIM]
opt_ids = torch.tensor(option_card_ids, dtype=torch.long)
opt_emb = self.card_emb(opt_ids) # [N, CARD_EMB_DIM]
scores = (proj @ opt_emb.T).squeeze(0) # [N] opt_emb is the card embedding table looked up at the
offered-card vocab indices — the same table the
combat policy uses as input. That’s the hypothesis of this head
in a sentence: the combat trunk already learned a 16-dim representation
of every card in the vocabulary; similar cards cluster in that space
because they appear in similar contexts and benefit from similar value
estimates. The choice head’s projection asks “which of
these card embeddings is closest, in this space, to what the policy
would play right now?” The dot product is the answer.
If instead the head had its own MLP over card features, it would need
to re-learn the card similarity structure from a much smaller number of
gradient updates — card-reward prompts fire once per combat, not
once per step. Reusing card_emb inherits every card-level
gradient update the combat head produces. That’s the same bet the
rest of the curriculum is making at stage 10.5: that V(state)
from the combat model is a usable reward signal for the deck-building
model, because the combat model already understands cards.
No softmax temperature is applied to the scores. Entropy at this head
is controlled by how well-separated the card embeddings become during
combat training. If card_emb packs similar cards into
tight clusters, the head is sharper; if it spreads them out, softer.
That’s not a tunable parameter at this level — it’s
an emergent property of the combat training that this head is standing
on.
Sampling without replacement
selected = []
total_log_prob = torch.tensor(0.0)
remaining_mask = torch.ones(len(option_card_ids), dtype=torch.bool)
for _ in range(n_select):
masked_scores = scores.masked_fill(~remaining_mask, -1e9)
dist = torch.distributions.Categorical(logits=masked_scores)
idx = dist.sample()
total_log_prob = total_log_prob + dist.log_prob(idx)
selected.append(idx.item())
remaining_mask[idx] = False
return selected, total_log_prob
Standard card-reward prompts are pick-one-of-N, but the interface
allows n_select > 1 for the cases where the game asks
for multiple picks. The sampling loop runs one pick at a time,
subtracting each picked index from the remaining pool before the next
draw.
masked_fill(~remaining_mask, -1e9) uses the same
finite-margin trick the policy head uses for illegal-action masking.
-1e9 instead of float('-inf') so
logsumexp gradients don’t poison with NaN
downstream. The policy head’s write-up in the
Neural Network deep dive has
the full argument for the finite margin. Both heads share the trick.
The returned log_prob is the sum of per-pick log-probs
across the selection order — which, as it turns out, is exactly
the thing the PPO importance ratio can’t re-evaluate.
old_lp
The log-prob this head returns is deliberately excluded from the step
log-prob pushed to the learner. PPO would need to recompute it from
the stored transition at training time, and it can’t:
choice_proj has no gradient path from the PPO loss, the
sampling order isn’t stored, and the remaining_mask
evolution can’t be reconstructed. The comment now in
actor_proxy.handle_connection — “Don’t
include choice log_probs — PPO can’t re-evaluate
them” — is the receipt for having learned that the hard
way. The Learned confession
choice-head-silent-corruption
is the full story.
Training the head
choice_proj has no PPO gradient path today, which means
the head is making selections but not learning from them. That’s
a training-loop question the project hasn’t fully resolved yet.
The two candidate answers:
A second PPO loop, choice-event-only. Store the
sampling order with each card-reward transition. At training time, run
choose_cards through the learner’s copy of the
policy against the stored order, produce a “new” log-prob,
and ratio it against the stored “old” log-prob. Build a
separate clipped-surrogate objective with its own clip range and
entropy coefficient. Train on episodic return as reward (the
long-horizon win rate the combat policy sees). The cost: a second set
of PPO hyperparameters, plus sampling-order storage in every stored
transition.
Frozen combat, V-as-reward. Hold the combat policy
and value networks fixed. Run the card-choice head forward on
card-reward prompts and take the selection. Use
V(state_after_pick) — the frozen value network’s
estimate of the post-pick state — as the reward signal. Update
choice_proj by policy-gradient directly on that reward.
No importance ratio, no sampling-order storage, no second PPO loop.
The cost: the combat policy is a fixed bar — improvements on the
choice side can’t back-propagate into better combat play.
The current plan is the second. The planned “stage 10.5” extension is defined around exactly that setup: freeze the combat network, train the card-choice head with V as reward, measure whether the policy with its own deck outperforms the policy with a random deck. Getting there requires the head to exist and produce selections first; that’s what this article is documenting.
Takeaways
card_emb is the design point — it inherits every card-level gradient the combat head has already produced. remaining_mask zeros out already-picked indices; masked_fill(~mask, -1e9) keeps gradients finite. old_lp; a separate training loop (V-as-reward on frozen combat) picks up the gradient work.