Input Vector Design — One-Hot Was a Trap
How the input vector got from ~1500 dimensions to 583, and how I found the second trap by accident — not by design.
Per-slot one-hots for cards and enemies, bitmaps for relics and powers.
- 10 hand slots × 80-card one-hot = 800 dims
- No semantic link between Strike and Strike+
- Every curriculum unlock regressed the policy
— no generalization
Learned nn.Embedding per entity, mean-pooled unordered collections.
- Cards 16d, enemies 16d, relics/powers 8d
- Similar cards cluster in embedding space
- Curriculum additions calibrate in a few hundred ep.
— faster, generalizes
A quick glossary before the story: one-hot is a single “check one box out of N” column per card or enemy — 80 columns for 80 cards, exactly one of them lit at a time. An embedding is a small learned vector per card instead — 16 numbers the network tunes itself so that similar cards end up nearby. Mean-pooling is just averaging those vectors together when the game hands you an unordered pile of things.
Two traps sat inside this design. One I spotted, one I didn't — for three iterations.
The vector as a bridge
Before any of the interesting learning happens, there's a plumbing problem. The game speaks in objects and references — a Card has a cost and a type and a keyword list, an Enemy has intents and powers, a combat room has a draw pile and a discard pile and a player whose HP is at some number. The network speaks in floats. Somewhere between the two, I had to pick a shape for the input tensor and commit to it. Everything else — policy loss, value targets, the curriculum, the reward function — all sits on top of the decisions I made at that interface.
That's the part I kept underestimating. I thought of the input vector as a low-stakes choice I'd iterate on once the network was training. It turns out encoding choice determines how fast the network trains, how much memory each batch costs, and whether similar entities look similar to the policy at all. Pick something that makes every card a stranger to every other card, and the network spends its capacity relearning that Strike and Strike+ are related. Pick something that clips the top of the numeric range silently, and the network stops being able to tell "bad" from "catastrophic." Both of those happened. Only one of them announced itself.
Trap 1 — one-hot had no semantic generalization
The first pass was one-hot everything, because one-hot is the encoding every RL tutorial uses for discrete categorical data. A card in hand was an 80-dim one-hot over the full card pool, repeated across ten slots — 800 dimensions right there. Enemies were one-hots over the ~15-entity enemy pool across six slots. Relics were a 30-dim bitmap. Powers were a 40-dim bitmap. With scalars on top (HP, block, energy, turn, debuffs, a handful of action counters), the input vector cleared 1500 dimensions. The first fully-connected layer was enormous and the forward pass was sluggish, but I told myself that was just the cost of doing business.
The cost I didn't clock was the semantic one. One-hot gives every card an independent column. That sounds fine in the abstract. What it means in practice is that the network has no structural reason to treat Strike and Strike+ as related, no structural reason to treat Defend and Neutralize as different kinds of things, no structural reason to group the poison cards together. It has to discover all of that from the gradient signal — the small nudges training delivers each time the network gets a prediction wrong — one card at a time.
Curriculum made this worse on a schedule. Stage 3 introduces draw cards. Stage 4 introduces poison cards. Until each stage unlocked, the one-hot column for those cards had been zero for thousands of episodes — the weights pointing into it were essentially random. On the first episode a new card appeared, the network was reading noise. I kept noticing the policy regressing briefly at every stage transition and rationalizing it as "curricula are just bumpy." They're much less bumpy when similar cards share representation.
The one-hot slot for that card had been zero for thousands of episodes. When the curriculum unlocked it, the weights on that slot were effectively random — the agent had to learn the new card's effect from scratch, with no help from similar cards.
FIX →replace one-hot with nn.Embedding(vocab, 16) Same game state, less than 40% the width. Hand and enemies fill nearly 90% of the budget.
Player (18). The scalar block — HP, block, energy, turn number, a handful of debuff stacks, a small set of action counters. No embeddings here; these are just numbers, normalized and clipped. This part of the vector didn't change between the one-hot version and the embedding version, but it's worth naming anyway because this is where Trap 2 was hiding, in the constants I divided each of these scalars by.
Hand (270). Ten slots, each containing a 16-dimensional
card embedding plus eleven scalar extras — cost, type, keyword flags, an
upgraded marker. The embedding table is nn.Embedding(vocab_size, 16)
— PyTorch's built-in lookup table, one row of 16 numbers per card in the
vocabulary — and it's trained end-to-end with the policy. Strike and Strike+ end up in
neighboring parts of embedding space without me asking for that; poison
cards cluster; defense cards form their own region. I didn't design for
that, and I can't fully predict it from the gradient math, but it falls
out because cards with similar effects appear in similar contexts and
benefit from similar value estimates.
Enemies (252). Six slots, each a 16-dim embedding plus
seventeen scalars (the four intent flags are included in those seventeen,
not stacked on top) plus a 9-dim per-enemy power slot — an 8-dim power
embedding and one amount scalar. 6 × (16 + 17 + 9) = 252.
This is the widest single group after hand, and for good reason — enemy
state drives almost all of my short-horizon decisions. The intent flags
are a small-but-essential block; without them, the network is guessing at
what's about to hit.
Piles (24). Draw, discard, and exhaust, each mean-pooled to 8 dimensions. This was one of the structural calls that felt wrong at first and turned out to be right. The first instinct is to encode a pile position by position — slot 0, slot 1, slot 2 — which is what you'd do for the hand. But pile contents are unordered. Slot 0 of the draw pile has no fixed semantic meaning. Mean-pooling the card embeddings gives the network a fixed-size summary of "what's in the draw pile" without baking in a positional lie.
Relics (10). All relics mean-pooled into an 8-dim embedding plus a counter and an active flag. Fewer dimensions than piles, because relics are a small unordered set whose effects are either passive or triggered by rules the network can infer from play. Over-investing in relic representation at this stage would have been wasted capacity.
Powers (9). Active player powers mean-pooled — 8-dim embedding plus an amount scalar. The smallest block in the vector, and also the one I came back to last when I was trying to understand why a run felt off. Powers usually don't dominate combat, but they shape it, and I wanted the representation to be at least non-zero.
| Group | Dims | What |
|---|---|---|
| Player scalars | 18 | HP, block, energy, turn, debuffs, counters |
| Hand | 270 | 10 × (16d embedding + 11 extras) |
| Enemies | 252 | 6 × (16d embedding + 17 scalars incl. 4 intent flags + 9 per-enemy power slot) |
| Piles | 24 | draw/discard/exhaust mean-pooled |
| Relics | 10 | all relics mean-pooled |
| Powers | 9 | player powers mean-pooled |
| Total | 583 |
Iteration 3 — vocabulary growth
The curriculum adds cards and enemies as it advances, which means the vocabulary has to grow during a single training run. My first approach was to preallocate the embedding table to the full-game vocabulary at start and leave the unused slots as untrained random embeddings, waiting for their entities to show up. It was the lazy right answer — until I noticed that those untrained slots were leaking random activations into downstream computations. The network had to learn to ignore them, which wasn't catastrophic but was wasted capacity.
The approach that worked: the learner holds a vocabulary index that grows
as new entities are seen, and the nn.Embedding table gets
rebuilt on growth — new random rows appended, existing rows preserved.
The vocabulary is serialized alongside the policy state dict, so when
actor proxies pull weights they also pull the updated vocabulary and
resize their copy of the table. No re-training, no lost progress on
previously-seen entities.
def resize_embedding(old: nn.Embedding, new_size: int) -> nn.Embedding:
if new_size <= old.num_embeddings:
return old
new_emb = nn.Embedding(new_size, old.embedding_dim)
with torch.no_grad():
new_emb.weight[:old.num_embeddings] = old.weight
return new_emb Trap 2 — the silent ruin of normalization bounds
The bigger lesson: the normalization bounds matter as much as the encoding scheme, and getting them wrong silently ruins training.
Here's the one I didn't see for three iterations. Every numeric scalar
in the input gets normalized to roughly [0, 1] or
[-1, 1] with clipping outside the range. I'd written
_norm(x, lo, hi) early on, picked some caps based on the
maxes I saw in early training runs — Poison divided by 100, Strength
divided by 50, Vulnerable divided by 50 — and stopped thinking about
them.
Those caps were wildly too low. Late-game poison builds run poison stacks well above 100. Elite enemies hit Strength 150+. A heavily debuffed player can exceed Vulnerable 50. Above the cap, the input clipped at 1.0. The network lost all discrimination between "moderately high" and "extremely high" — both mapped to the same normalized value. That's not a training-failure signal I could see in the loss curve. It's a silent loss of information, which is worse.
I looked at the game source for help, hoping there was a canonical max
somewhere. PowerModel.SetAmount caps at 999999999
— effectively unbounded, engineered as a defensive limit against integer
wraparound. No useful information for normalization. The right answer was
empirical: set caps based on the realistic ranges I'd seen playing STS2,
roughly 2× the observed 99th percentile, and the network regains the
discrimination the clipping had been stealing.
The network lost all discrimination between "moderately high" and
"extremely high" — they all looked the same. I looked at the game
source for help. PowerModel.SetAmount caps at
999999999, which is effectively unbounded; no useful
information there.
set caps from gameplay, not from code defaults | Scalar | Cap |
|---|---|
| Vulnerable | 300 |
| Weak | 50 |
| Poison | 200 |
| Strength | ±300 |
| Artifact | 30 |
| Intangible | 20 |
| Thorns | 50 |
| Ritual | 5 |
| Dexterity | ±50 |
| Frail | 100 |
| Enemy max HP | 1000 |
empirical caps · ~2× observed 99th
What I'd do first next time
- Use learned embeddings for categorical features from day one. One-hot is fine for small discrete action spaces. For entity IDs with any kind of structural similarity, embeddings win.
- Design for vocabulary growth from the start. Even if the curriculum doesn't literally add cards, the card pool gets refactored. The resize path is cheap to build up front.
- Derive normalization caps from gameplay, not from code. Internal code caps are almost always defensive and effectively infinite; they tell you nothing about realistic ranges.
- Pool unordered collections. Positional encoding of unordered data is wasted capacity.
Key takeaways
- Embeddings beat one-hot for entity-like categorical features — more compact, more expressive, and similarity falls out for free.
- Let the network learn its own representation. Hand-designed features are tempting; learned embeddings consistently outperform them on data with real structure.
- Normalization caps matter enormously. Too low clips signal; too high wastes range; pick them empirically.
- Plan for vocabulary growth. Mid-training vocabulary change is a "when," not an "if."
- Pool unordered collections. Positional encoding of unordered data is wasted capacity.