← Deep Dives

The Card-Choice Head: the Fourth Head on the Trunk

Deep Dive · 2026-04-19 · 8 min read

The Neural Network deep dive named three heads. This post documents the fourth — a card-reward picker that scores offered cards by projecting the policy hidden state into card-embedding space, samples without replacement, and carries no PPO gradient path.

card_emb 16d enemy_emb 16d relic_emb 8d pile_emb 8d power_emb 8d scalars 18d concat Linear · LayerNorm · ReLU 256 Linear · LayerNorm · ReLU 256 policy head 61 actions value V(s) discard head aux · 10 choice head no PPO grad path
4 heads · 1 without a PPO gradient path

Why this head exists

The combat policy I’ve been writing about in the other Deep Dives plays turns, cards, and targets. It ends with a win or a loss. What it doesn’t do, yet, is choose the cards that go into the deck. The card-choice head is the smallest step toward that — a head that fires only at out-of-combat reward prompts and picks one of the offered cards.

Card-reward prompts look like this: after a combat ends, the game offers a small set of new cards and the player picks one (or sometimes skips). Environmentally, it’s a choice_handler callback: the C# mod passes options: list[dict] plus min_select / max_select counts, and the actor side returns a list of selected indices plus a log-prob. The head’s whole job is to produce that selection.

The curriculum reaches the point where this matters at what I’ve been calling “stage 10.5” — a planned extension past the shipped 13-stage curriculum, a V-as-reward deck-building milestone where the trained combat policy stops playing random decks and starts building its own. Not live yet; the head exists for that moment. Everything else in this article is the machinery that gets it ready.

The projection

self.choice_proj = nn.Linear(HIDDEN_DIM, CARD_EMB_DIM)

One linear layer, no nonlinearity on the way out. HIDDEN_DIM is 256 — the output of the shared trunk. CARD_EMB_DIM is 16 — the width of the card embedding table. The projection maps policy hidden state into the same 16-dim space the card embeddings live in.

The choice to make this a nn.Linear rather than a small MLP was deliberate. A multi-layer projection would have more representational capacity, but it would also need to re-learn the card representation from scratch — there’s no gradient signal reaching the choice head from the shared trunk during combat steps, because the choice head only fires at card-reward prompts. The linear projection keeps the parameter count tiny (256 × 16 = 4096 weights plus 16 biases) and keeps the relationship between hidden state and card space invertibly simple.

The scoring

h = self._trunk_forward(enc)                  # [1, HIDDEN_DIM]
proj = self.choice_proj(h)                     # [1, CARD_EMB_DIM]
opt_ids = torch.tensor(option_card_ids, dtype=torch.long)
opt_emb = self.card_emb(opt_ids)               # [N, CARD_EMB_DIM]
scores = (proj @ opt_emb.T).squeeze(0)         # [N]

opt_emb is the card embedding table looked up at the offered-card vocab indices — the same table the combat policy uses as input. That’s the hypothesis of this head in a sentence: the combat trunk already learned a 16-dim representation of every card in the vocabulary; similar cards cluster in that space because they appear in similar contexts and benefit from similar value estimates. The choice head’s projection asks “which of these card embeddings is closest, in this space, to what the policy would play right now?” The dot product is the answer.

If instead the head had its own MLP over card features, it would need to re-learn the card similarity structure from a much smaller number of gradient updates — card-reward prompts fire once per combat, not once per step. Reusing card_emb inherits every card-level gradient update the combat head produces. That’s the same bet the rest of the curriculum is making at stage 10.5: that V(state) from the combat model is a usable reward signal for the deck-building model, because the combat model already understands cards.

No softmax temperature is applied to the scores. Entropy at this head is controlled by how well-separated the card embeddings become during combat training. If card_emb packs similar cards into tight clusters, the head is sharper; if it spreads them out, softer. That’s not a tunable parameter at this level — it’s an emergent property of the combat training that this head is standing on.

Sampling without replacement

selected = []
total_log_prob = torch.tensor(0.0)
remaining_mask = torch.ones(len(option_card_ids), dtype=torch.bool)

for _ in range(n_select):
    masked_scores = scores.masked_fill(~remaining_mask, -1e9)
    dist = torch.distributions.Categorical(logits=masked_scores)
    idx = dist.sample()
    total_log_prob = total_log_prob + dist.log_prob(idx)
    selected.append(idx.item())
    remaining_mask[idx] = False

return selected, total_log_prob

Standard card-reward prompts are pick-one-of-N, but the interface allows n_select > 1 for the cases where the game asks for multiple picks. The sampling loop runs one pick at a time, subtracting each picked index from the remaining pool before the next draw.

masked_fill(~remaining_mask, -1e9) uses the same finite-margin trick the policy head uses for illegal-action masking. -1e9 instead of float('-inf') so logsumexp gradients don’t poison with NaN downstream. The policy head’s write-up in the Neural Network deep dive has the full argument for the finite margin. Both heads share the trick.

The returned log_prob is the sum of per-pick log-probs across the selection order — which, as it turns out, is exactly the thing the PPO importance ratio can’t re-evaluate.

Caveat · the log-prob isn’t in old_lp

The log-prob this head returns is deliberately excluded from the step log-prob pushed to the learner. PPO would need to recompute it from the stored transition at training time, and it can’t: choice_proj has no gradient path from the PPO loss, the sampling order isn’t stored, and the remaining_mask evolution can’t be reconstructed. The comment now in actor_proxy.handle_connection — “Don’t include choice log_probs — PPO can’t re-evaluate them” — is the receipt for having learned that the hard way. The Learned confession choice-head-silent-corruption is the full story.

Training the head

choice_proj has no PPO gradient path today, which means the head is making selections but not learning from them. That’s a training-loop question the project hasn’t fully resolved yet. The two candidate answers:

A second PPO loop, choice-event-only. Store the sampling order with each card-reward transition. At training time, run choose_cards through the learner’s copy of the policy against the stored order, produce a “new” log-prob, and ratio it against the stored “old” log-prob. Build a separate clipped-surrogate objective with its own clip range and entropy coefficient. Train on episodic return as reward (the long-horizon win rate the combat policy sees). The cost: a second set of PPO hyperparameters, plus sampling-order storage in every stored transition.

Frozen combat, V-as-reward. Hold the combat policy and value networks fixed. Run the card-choice head forward on card-reward prompts and take the selection. Use V(state_after_pick) — the frozen value network’s estimate of the post-pick state — as the reward signal. Update choice_proj by policy-gradient directly on that reward. No importance ratio, no sampling-order storage, no second PPO loop. The cost: the combat policy is a fixed bar — improvements on the choice side can’t back-propagate into better combat play.

The current plan is the second. The planned “stage 10.5” extension is defined around exactly that setup: freeze the combat network, train the card-choice head with V as reward, measure whether the policy with its own deck outperforms the policy with a random deck. Getting there requires the head to exist and produce selections first; that’s what this article is documenting.

Takeaways

01 Reuse the combat card embedding. The dot product against card_emb is the design point — it inherits every card-level gradient the combat head has already produced.
02 Sampling without replacement = iterate + mask. The remaining_mask zeros out already-picked indices; masked_fill(~mask, -1e9) keeps gradients finite.
03 No PPO gradient path ⇒ no PPO training. The choice log-prob is deliberately excluded from old_lp; a separate training loop (V-as-reward on frozen combat) picks up the gradient work.
04 One linear layer, one dot product, one mask. The simplest structural thing that could work — and the right one to try before anything more elaborate.