← What I Learned

Input Vector Design — One-Hot Was a Trap

What I Learned · 2026-04-19 · 8 min read

How the input vector got from ~1500 dimensions to 583, and how I found the second trap by accident — not by design.

Before · one-hot

Per-slot one-hots for cards and enemies, bitmaps for relics and powers.

  • 10 hand slots × 80-card one-hot = 800 dims
  • No semantic link between Strike and Strike+
  • Every curriculum unlock regressed the policy
~1500 Dimensions · bloated
— no generalization
After · embeddings

Learned nn.Embedding per entity, mean-pooled unordered collections.

  • Cards 16d, enemies 16d, relics/powers 8d
  • Similar cards cluster in embedding space
  • Curriculum additions calibrate in a few hundred ep.
583 Dimensions · tighter
— faster, generalizes

A quick glossary before the story: one-hot is a single “check one box out of N” column per card or enemy — 80 columns for 80 cards, exactly one of them lit at a time. An embedding is a small learned vector per card instead — 16 numbers the network tunes itself so that similar cards end up nearby. Mean-pooling is just averaging those vectors together when the game hands you an unordered pile of things.

Two traps sat inside this design. One I spotted, one I didn't — for three iterations.

The vector as a bridge

Before any of the interesting learning happens, there's a plumbing problem. The game speaks in objects and references — a Card has a cost and a type and a keyword list, an Enemy has intents and powers, a combat room has a draw pile and a discard pile and a player whose HP is at some number. The network speaks in floats. Somewhere between the two, I had to pick a shape for the input tensor and commit to it. Everything else — policy loss, value targets, the curriculum, the reward function — all sits on top of the decisions I made at that interface.

That's the part I kept underestimating. I thought of the input vector as a low-stakes choice I'd iterate on once the network was training. It turns out encoding choice determines how fast the network trains, how much memory each batch costs, and whether similar entities look similar to the policy at all. Pick something that makes every card a stranger to every other card, and the network spends its capacity relearning that Strike and Strike+ are related. Pick something that clips the top of the numeric range silently, and the network stops being able to tell "bad" from "catastrophic." Both of those happened. Only one of them announced itself.

Trap 1 — one-hot had no semantic generalization

The first pass was one-hot everything, because one-hot is the encoding every RL tutorial uses for discrete categorical data. A card in hand was an 80-dim one-hot over the full card pool, repeated across ten slots — 800 dimensions right there. Enemies were one-hots over the ~15-entity enemy pool across six slots. Relics were a 30-dim bitmap. Powers were a 40-dim bitmap. With scalars on top (HP, block, energy, turn, debuffs, a handful of action counters), the input vector cleared 1500 dimensions. The first fully-connected layer was enormous and the forward pass was sluggish, but I told myself that was just the cost of doing business.

The cost I didn't clock was the semantic one. One-hot gives every card an independent column. That sounds fine in the abstract. What it means in practice is that the network has no structural reason to treat Strike and Strike+ as related, no structural reason to treat Defend and Neutralize as different kinds of things, no structural reason to group the poison cards together. It has to discover all of that from the gradient signal — the small nudges training delivers each time the network gets a prediction wrong — one card at a time.

Curriculum made this worse on a schedule. Stage 3 introduces draw cards. Stage 4 introduces poison cards. Until each stage unlocked, the one-hot column for those cards had been zero for thousands of episodes — the weights pointing into it were essentially random. On the first episode a new card appeared, the network was reading noise. I kept noticing the policy regressing briefly at every stage transition and rationalizing it as "curricula are just bumpy." They're much less bumpy when similar cards share representation.

Trap · Bloat One-hot encoding treats every card as completely unrelated to every other card.

The one-hot slot for that card had been zero for thousands of episodes. When the curriculum unlocked it, the weights on that slot were effectively random — the agent had to learn the new card's effect from scratch, with no help from similar cards.

FIX → replace one-hot with nn.Embedding(vocab, 16)
What the 583 actually looks like

Same game state, less than 40% the width. Hand and enemies fill nearly 90% of the budget.

Player (18). The scalar block — HP, block, energy, turn number, a handful of debuff stacks, a small set of action counters. No embeddings here; these are just numbers, normalized and clipped. This part of the vector didn't change between the one-hot version and the embedding version, but it's worth naming anyway because this is where Trap 2 was hiding, in the constants I divided each of these scalars by.

Hand (270). Ten slots, each containing a 16-dimensional card embedding plus eleven scalar extras — cost, type, keyword flags, an upgraded marker. The embedding table is nn.Embedding(vocab_size, 16) — PyTorch's built-in lookup table, one row of 16 numbers per card in the vocabulary — and it's trained end-to-end with the policy. Strike and Strike+ end up in neighboring parts of embedding space without me asking for that; poison cards cluster; defense cards form their own region. I didn't design for that, and I can't fully predict it from the gradient math, but it falls out because cards with similar effects appear in similar contexts and benefit from similar value estimates.

Enemies (252). Six slots, each a 16-dim embedding plus seventeen scalars (the four intent flags are included in those seventeen, not stacked on top) plus a 9-dim per-enemy power slot — an 8-dim power embedding and one amount scalar. 6 × (16 + 17 + 9) = 252. This is the widest single group after hand, and for good reason — enemy state drives almost all of my short-horizon decisions. The intent flags are a small-but-essential block; without them, the network is guessing at what's about to hit.

Piles (24). Draw, discard, and exhaust, each mean-pooled to 8 dimensions. This was one of the structural calls that felt wrong at first and turned out to be right. The first instinct is to encode a pile position by position — slot 0, slot 1, slot 2 — which is what you'd do for the hand. But pile contents are unordered. Slot 0 of the draw pile has no fixed semantic meaning. Mean-pooling the card embeddings gives the network a fixed-size summary of "what's in the draw pile" without baking in a positional lie.

Relics (10). All relics mean-pooled into an 8-dim embedding plus a counter and an active flag. Fewer dimensions than piles, because relics are a small unordered set whose effects are either passive or triggered by rules the network can infer from play. Over-investing in relic representation at this stage would have been wasted capacity.

Powers (9). Active player powers mean-pooled — 8-dim embedding plus an amount scalar. The smallest block in the vector, and also the one I came back to last when I was trying to understand why a run felt off. Powers usually don't dominate combat, but they shape it, and I wanted the representation to be at least non-zero.

GroupDimsWhat
Player scalars18HP, block, energy, turn, debuffs, counters
Hand27010 × (16d embedding + 11 extras)
Enemies2526 × (16d embedding + 17 scalars incl. 4 intent flags + 9 per-enemy power slot)
Piles24draw/discard/exhaust mean-pooled
Relics10all relics mean-pooled
Powers9player powers mean-pooled
Total583

Iteration 3 — vocabulary growth

The curriculum adds cards and enemies as it advances, which means the vocabulary has to grow during a single training run. My first approach was to preallocate the embedding table to the full-game vocabulary at start and leave the unused slots as untrained random embeddings, waiting for their entities to show up. It was the lazy right answer — until I noticed that those untrained slots were leaking random activations into downstream computations. The network had to learn to ignore them, which wasn't catastrophic but was wasted capacity.

The approach that worked: the learner holds a vocabulary index that grows as new entities are seen, and the nn.Embedding table gets rebuilt on growth — new random rows appended, existing rows preserved. The vocabulary is serialized alongside the policy state dict, so when actor proxies pull weights they also pull the updated vocabulary and resize their copy of the table. No re-training, no lost progress on previously-seen entities.

def resize_embedding(old: nn.Embedding, new_size: int) -> nn.Embedding:
    if new_size <= old.num_embeddings:
        return old
    new_emb = nn.Embedding(new_size, old.embedding_dim)
    with torch.no_grad():
        new_emb.weight[:old.num_embeddings] = old.weight
    return new_emb

Trap 2 — the silent ruin of normalization bounds

The bigger lesson: the normalization bounds matter as much as the encoding scheme, and getting them wrong silently ruins training.

Here's the one I didn't see for three iterations. Every numeric scalar in the input gets normalized to roughly [0, 1] or [-1, 1] with clipping outside the range. I'd written _norm(x, lo, hi) early on, picked some caps based on the maxes I saw in early training runs — Poison divided by 100, Strength divided by 50, Vulnerable divided by 50 — and stopped thinking about them.

Those caps were wildly too low. Late-game poison builds run poison stacks well above 100. Elite enemies hit Strength 150+. A heavily debuffed player can exceed Vulnerable 50. Above the cap, the input clipped at 1.0. The network lost all discrimination between "moderately high" and "extremely high" — both mapped to the same normalized value. That's not a training-failure signal I could see in the loss curve. It's a silent loss of information, which is worse.

I looked at the game source for help, hoping there was a canonical max somewhere. PowerModel.SetAmount caps at 999999999 — effectively unbounded, engineered as a defensive limit against integer wraparound. No useful information for normalization. The right answer was empirical: set caps based on the realistic ranges I'd seen playing STS2, roughly 2× the observed 99th percentile, and the network regains the discrimination the clipping had been stealing.

Trap · Silent These caps were wildly too low.

The network lost all discrimination between "moderately high" and "extremely high" — they all looked the same. I looked at the game source for help. PowerModel.SetAmount caps at 999999999, which is effectively unbounded; no useful information there.

FIX → set caps from gameplay, not from code defaults
ScalarCap
Vulnerable300
Weak50
Poison200
Strength±300
Artifact30
Intangible20
Thorns50
Ritual5
Dexterity±50
Frail100
Enemy max HP1000

empirical caps · ~2× observed 99th

What I'd do first next time

  1. Use learned embeddings for categorical features from day one. One-hot is fine for small discrete action spaces. For entity IDs with any kind of structural similarity, embeddings win.
  2. Design for vocabulary growth from the start. Even if the curriculum doesn't literally add cards, the card pool gets refactored. The resize path is cheap to build up front.
  3. Derive normalization caps from gameplay, not from code. Internal code caps are almost always defensive and effectively infinite; they tell you nothing about realistic ranges.
  4. Pool unordered collections. Positional encoding of unordered data is wasted capacity.

Key takeaways

  1. Embeddings beat one-hot for entity-like categorical features — more compact, more expressive, and similarity falls out for free.
  2. Let the network learn its own representation. Hand-designed features are tempting; learned embeddings consistently outperform them on data with real structure.
  3. Normalization caps matter enormously. Too low clips signal; too high wastes range; pick them empirically.
  4. Plan for vocabulary growth. Mid-training vocabulary change is a "when," not an "if."
  5. Pool unordered collections. Positional encoding of unordered data is wasted capacity.