Curriculum Learning: You Can’t Skip the Basics
I spent weeks trying to train an agent on the full game before giving up and building a curriculum. The curriculum version reached higher performance on the full game in days than my direct approach reached in weeks.
Random decks, random enemies, all mechanics.
- Full ~80-card pool + all Act 1 enemies in rotation
- Powers, debuffs, upgrades, relics all on
- Two weeks with no measurable progress
— shallow, near-random
Starter deck + one enemy. Add one concept at a time.
- Stage 0: starter deck only vs one weak enemy
- Each stage adds one new idea, not five
- Auto-advance on win-rate plateau
— foundation transfers up
Two weeks of full-game training got me nowhere. The curriculum version of the same run reached higher performance in days than the direct approach reached in weeks. This is the write-up of what I did, what broke, and the 13-stage ladder that replaced it.
Full-game training from day one
Random deck construction from the full Silent card pool — roughly
80 cards. Random enemy selection across the full Act 1 roster
defined in stages.py — 21 normal encounters (single
and multi) plus 6 elites, so around 27 distinct fights. Full mechanical
complexity: powers, debuffs, upgraded cards, relic interactions.
Everything on, all the time.
The theory was that if the agent experiences the full distribution, it will learn a policy that generalizes. Curriculum learning, in that frame, is training wheels — it narrows the distribution, which has to limit final performance.
That was wrong. The result was that the agent learned nothing useful at all.
The agent learned nothing useful after thousands of episodes.
The state space was too large for the signal to reach. Four specific failure modes, which I will list in the order I noticed them:
- Rare card exposure. Eighty cards in the pool, ten cards in a deck. The agent saw most cards a handful of times per hundred episodes. “Play Strike, enemy HP drops, good” is a simple lesson, but you need to see Strike hundreds of times to learn it reliably. At a handful of exposures per hundred episodes, the signal could not stabilize.
- Enemy diversity without foundation. Fossil Stalker, Punch Construct, Sewer Clam, Vine Shambler, the Byrdonis and Phrog Parasite elites — each has distinct attack patterns, intents, and optimal responses. With random matchups, the agent couldn’t specialize enough to learn any single enemy well. It certainly couldn’t learn when to switch strategies.
- Win rate plateaued at ~15%. Better than random, but only because the agent had picked up the most shallow lesson available: play attacks when enemies have low HP, play defense on high-damage turns. That is essentially the policy a five-minute-old human gets to. It was not going to get deeper without more signal.
- Training signal was too noisy. Every episode was a completely different combat. The variance in reward per episode was huge, so gradient updates pointed in wildly different directions. PPO’s trust region — the per-update cap on how far the policy can shift — kept the policy changes small, but not small enough to prevent drift around a noisy mean.
After about two weeks of this with no measurable progress, I pivoted.
Stage 0 — starter deck, one enemy — got the agent above 80% in roughly 500 episodes. The curriculum wasn’t a training wheel for this project; it was the only shape of task the signal could reach through.
I broke training into 13 stages, gradually introducing complexity.
Stage 0 is almost trivial — 4 unique cards, one enemy type, single-digit HP difference per turn. The agent could figure out the basics fast: 80%+ win rate in maybe 500 episodes.
By Stage 5, the agent had a solid foundation of card-game fundamentals: block when the enemy is about to hit hard, prioritize killing weak enemies first, upgrade cards for efficiency. Each subsequent stage added a few new concepts, not an entirely new game.
Starter deck · single normal enemy
Fixed 12-card starter (Strike, Defend, Neutralize, Survivor) against one Act 1 single-normal enemy from ENEMY_NORMAL_SINGLE — Fossil Stalker, Punch Construct, Sewer Clam, etc. No upgrades.
Adds the six basic damage and block cards
Bucket: BUCKET_A (Slice, Deflect, Dagger Spray, Backstab, Dash, Skewer). Deck size widens to 15–20, upgrade chance turns on at 10%. Enemy pool unchanged.
Adds the ten debuff cards — the first cards that apply status effects
Bucket: BUCKET_D (Neutralize, Piercing Wail, Sucker Punch, Expose, Leg Sweep, Strangle, Assassinate, Malaise, Blade of Ink, Suppress). Debuffs and status effects enter play. Enemy pool unchanged.
Adds draw-and-cycle cards, and multi-enemy fights unlock
Bucket: BUCKET_E (Backflip, Acrobatics, Expertise, Hidden Daggers, Predator, Adrenaline). Enemy pool expands to _ENEMIES_NM — normal-single plus normal-multi.
Adds poison cards — a second damage type
Bucket: BUCKET_B (Deadly Poison, Poisoned Stab, Snakebite, Bouncing Flask, Bubble Bubble). Poison as alternative damage. Enemy pool unchanged.
Adds shiv cards — a third damage mechanic
Bucket: BUCKET_C (Blade Dance, Cloak and Dagger, Leading Strike, Accuracy, Up My Sleeve, Storm of Steel, Knife Trap). Enemy pool unchanged.
Adds the sly / discard bucket — non-power pool now complete
Bucket: BUCKET_F (Survivor, Dagger Throw, Flick Flack, Prepared, Ricochet, Untouchable, Haze, Reflex, Tactician, Calculated Gamble), completing _CARDS_AF. Enemy pool unchanged.
Settle stage — no new cards, just consolidation on the non-power pool
No additions. _STAGE_DEFS pairs stage 7 with stage 6 so the plateau detector has more time to confirm the non-power pool is stable before elites show up.
Elites show up for the first time (no new cards)
Card pool stays at _CARDS_AF; enemy pool swaps to _ENEMIES_EL, adding the six Act 1 elites (Phantasmal Gardeners, Skulking Colony, Terror Eel, Bygone Effigy, Byrdonis, Phrog Parasite). First stage where elites appear.
Power cards unlock — persistent per-combat effects enter the mix
Bucket: BUCKET_G1 + G2 — poison/shiv scaling powers and general scaling powers, both at once. Upgrade chance jumps 10% → 30% and deck_size_max goes from 22 to 25. Enemy pool unchanged.
Combo / conditional cards unlock — full card pool now live
Bucket: BUCKET_H, completing _CARDS_ALL. Enemy pool unchanged from stage 8.
Settle stage — let performance stabilize on the full pool + elites
No additions. Another settle/confirm stage on the full pool + elites.
Final stage — the curriculum tops out here
Labeled “final” by design; no new cards or enemies. Identical to stages 10–11.
Automatic stage advancement
The learner advances on an HP-based plateau, not a
win-rate threshold. learner.py::check_plateau compares
two sliding windows of average HP remaining:
PLATEAU_WINDOW = 100 updates per window, advance when
the delta between windows stays within
PLATEAU_THRESHOLD = 0.5 HP for
PLATEAU_CONFIRM = 50 consecutive updates, and only if
recent average HP is already above the
PLATEAU_MIN_HP = 25 HP gate. No separate elite rule —
elite stages run the same detector.
Early stages blow through quickly — maybe an hour of training each. Hard stages get as much time as they need. And I don’t have to manually decide when to advance, which is the productivity piece that actually matters. Manual advancement steals momentum every time you pause the run; automated advancement means I can leave training overnight and come back to a model that’s several stages further along.
Stage config lives in stage.json, written by
StageConfig.to_dict in stages.py. Six keys:
stage, deck_size_min, deck_size_max,
upgrade_chance, card_pool, enemy_pool.
Deck construction is weighted random draws from the stage’s
card_pool into a deck sized within
[deck_size_min, deck_size_max], with each card upgraded
independently at upgrade_chance. Per _STAGE_DEFS
that chance jumps from 0% (stage 0) to 10% (stages 1–8) to 30%
(stages 9–12) — stepped, not a smooth ramp. Rollback —
falling back to a prior stage if the agent regresses — I considered
and never implemented. It hasn’t been needed.
What I’d do first next time
Start with the simplest possible version of the task. Before thinking about reward shaping or network architecture, identify the minimum viable game state: fewest cards, one enemy, no special mechanics. Train on that until win rate is above 80%. Add one dimension of complexity at a time — more cards, or more enemies, but not both at once. Only move on when the current stage is mastered.
Auto-advancement is worth the engineering cost. Manual stage advancement is a productivity killer — every pause kills momentum. The ~100 lines of plateau-detection code paid for themselves inside the first weekend of overnight runs.
And don’t trust the intuition that curriculum limits final performance. I was worried that if I trained on Stage 0 too long, the agent would “overfit” to simple combat and fail to generalize. That turned out to be almost entirely wrong. The fundamentals transfer — an agent that learned to block on Stage 0 keeps blocking on Stage 12.
Technical details
- Stage config:
stage.json— keysstage,deck_size_min,deck_size_max,upgrade_chance,card_pool,enemy_pool(seestages.py::StageConfig.to_dict) - Deck construction: weighted random draws from the stage’s
card_pool; per-card upgrade chance0%at stage 0,10%at stages 1–8,30%at stages 9–12 - Advancement criteria: HP-based plateau detector (
learner.py::check_plateau) on two sliding windows of average HP remaining - Plateau params:
PLATEAU_WINDOW = 100updates,PLATEAU_CONFIRM = 50consecutive,PLATEAU_THRESHOLD = 0.5 HPdelta,PLATEAU_MIN_HP = 25 HPgate - Rollback: considered but not implemented
Key takeaways
- Curriculum isn’t a training wheel, it’s a necessity. For sufficiently complex tasks, you cannot bootstrap a useful policy from random initialization without one. The agent has to stand somewhere before it can climb.
- Minimum viable task first. Don’t start with the full problem. Strip it down until a nearly-random policy can win sometimes. The whole curriculum rests on Stage 0 being easy enough that random play produces some wins.
- Automate stage transitions. Manual advancement steals time and attention you could spend on bigger questions. Write the plateau detector early.
- Each new stage should add one concept, not five. The goal is that the agent’s existing policy is still ~80% correct and it just has to learn the new piece. If you’re adding five concepts, the agent is back to random on a smaller domain.
- Trust that fundamentals transfer. An agent that masters the easy version almost always generalizes up, not down. The Stage 0 policy — block hard hits, kill weak enemies first — stays correct all the way through Stage 12.