Curriculum Learning: You Can’t Skip the Basics

What I Learned · 2026-04-18 · 7 min read

I spent weeks trying to train an agent on the full game before giving up and building a curriculum. The curriculum version reached higher performance on the full game in days than my direct approach reached in weeks.

Before · full-game

Random decks, random enemies, all mechanics.

Full ~80-card pool + all Act 1 enemies in rotation
Powers, debuffs, upgrades, relics all on
Two weeks with no measurable progress

~15% Full-game win rate
— shallow, near-random

After · curriculum

Starter deck + one enemy. Add one concept at a time.

Stage 0: starter deck only vs one weak enemy
Each stage adds one new idea, not five
Auto-advance on win-rate plateau

80%+ Stage 0 in ~500 ep.
— foundation transfers up

Two weeks of full-game training got me nowhere. The curriculum version of the same run reached higher performance in days than the direct approach reached in weeks. This is the write-up of what I did, what broke, and the 13-stage ladder that replaced it.

What I did first

Full-game training from day one

Random deck construction from the full Silent card pool — roughly 80 cards. Random enemy selection across the full Act 1 roster defined in stages.py — 21 normal encounters (single and multi) plus 6 elites, so around 27 distinct fights. Full mechanical complexity: powers, debuffs, upgraded cards, relic interactions. Everything on, all the time.

The theory was that if the agent experiences the full distribution, it will learn a policy that generalizes. Curriculum learning, in that frame, is training wheels — it narrows the distribution, which has to limit final performance.

That was wrong. The result was that the agent learned nothing useful at all.

~15%full-game win rate · stalled

What went wrong

The agent learned nothing useful after thousands of episodes.

The state space was too large for the signal to reach. Four specific failure modes, which I will list in the order I noticed them:

Rare card exposure. Eighty cards in the pool, ten cards in a deck. The agent saw most cards a handful of times per hundred episodes. “Play Strike, enemy HP drops, good” is a simple lesson, but you need to see Strike hundreds of times to learn it reliably. At a handful of exposures per hundred episodes, the signal could not stabilize.
Enemy diversity without foundation. Fossil Stalker, Punch Construct, Sewer Clam, Vine Shambler, the Byrdonis and Phrog Parasite elites — each has distinct attack patterns, intents, and optimal responses. With random matchups, the agent couldn’t specialize enough to learn any single enemy well. It certainly couldn’t learn when to switch strategies.
Win rate plateaued at ~15%. Better than random, but only because the agent had picked up the most shallow lesson available: play attacks when enemies have low HP, play defense on high-damage turns. That is essentially the policy a five-minute-old human gets to. It was not going to get deeper without more signal.
Training signal was too noisy. Every episode was a completely different combat. The variance in reward per episode was huge, so gradient updates pointed in wildly different directions. PPO’s trust region — the per-update cap on how far the policy can shift — kept the policy changes small, but not small enough to prevent drift around a noisy mean.

After about two weeks of this with no measurable progress, I pivoted.

Stage 0 — starter deck, one enemy — got the agent above 80% in roughly 500 episodes. The curriculum wasn’t a training wheel for this project; it was the only shape of task the signal could reach through.

The curriculum design

I broke training into 13 stages, gradually introducing complexity.

Stage 0 is almost trivial — 4 unique cards, one enemy type, single-digit HP difference per turn. The agent could figure out the basics fast: 80%+ win rate in maybe 500 episodes.

By Stage 5, the agent had a solid foundation of card-game fundamentals: block when the enemy is about to hit hard, prioritize killing weak enemies first, upgrade cards for efficiency. Each subsequent stage added a few new concepts, not an entirely new game.

Starter deck · single normal enemy

Fixed 12-card starter (Strike, Defend, Neutralize, Survivor) against one Act 1 single-normal enemy from ENEMY_NORMAL_SINGLE — Fossil Stalker, Punch Construct, Sewer Clam, etc. No upgrades.

Adds the six basic damage and block cards

Bucket: BUCKET_A (Slice, Deflect, Dagger Spray, Backstab, Dash, Skewer). Deck size widens to 15–20, upgrade chance turns on at 10%. Enemy pool unchanged.

Adds the ten debuff cards — the first cards that apply status effects

Bucket: BUCKET_D (Neutralize, Piercing Wail, Sucker Punch, Expose, Leg Sweep, Strangle, Assassinate, Malaise, Blade of Ink, Suppress). Debuffs and status effects enter play. Enemy pool unchanged.

Adds draw-and-cycle cards, and multi-enemy fights unlock

Bucket: BUCKET_E (Backflip, Acrobatics, Expertise, Hidden Daggers, Predator, Adrenaline). Enemy pool expands to _ENEMIES_NM — normal-single plus normal-multi.

Adds poison cards — a second damage type

Bucket: BUCKET_B (Deadly Poison, Poisoned Stab, Snakebite, Bouncing Flask, Bubble Bubble). Poison as alternative damage. Enemy pool unchanged.

Adds shiv cards — a third damage mechanic

Bucket: BUCKET_C (Blade Dance, Cloak and Dagger, Leading Strike, Accuracy, Up My Sleeve, Storm of Steel, Knife Trap). Enemy pool unchanged.

Adds the sly / discard bucket — non-power pool now complete

Bucket: BUCKET_F (Survivor, Dagger Throw, Flick Flack, Prepared, Ricochet, Untouchable, Haze, Reflex, Tactician, Calculated Gamble), completing _CARDS_AF. Enemy pool unchanged.

Settle stage — no new cards, just consolidation on the non-power pool

No additions. _STAGE_DEFS pairs stage 7 with stage 6 so the plateau detector has more time to confirm the non-power pool is stable before elites show up.

Elites show up for the first time (no new cards)

Card pool stays at _CARDS_AF; enemy pool swaps to _ENEMIES_EL, adding the six Act 1 elites (Phantasmal Gardeners, Skulking Colony, Terror Eel, Bygone Effigy, Byrdonis, Phrog Parasite). First stage where elites appear.

Power cards unlock — persistent per-combat effects enter the mix

Bucket: BUCKET_G1 + G2 — poison/shiv scaling powers and general scaling powers, both at once. Upgrade chance jumps 10% → 30% and deck_size_max goes from 22 to 25. Enemy pool unchanged.

Combo / conditional cards unlock — full card pool now live

Bucket: BUCKET_H, completing _CARDS_ALL. Enemy pool unchanged from stage 8.

Settle stage — let performance stabilize on the full pool + elites

No additions. Another settle/confirm stage on the full pool + elites.

Final stage — the curriculum tops out here

Labeled “final” by design; no new cards or enemies. Identical to stages 10–11.

Automatic stage advancement

One important detail

The learner advances on an HP-based plateau, not a win-rate threshold. learner.py::check_plateau compares two sliding windows of average HP remaining: PLATEAU_WINDOW = 100 updates per window, advance when the delta between windows stays within PLATEAU_THRESHOLD = 0.5 HP for PLATEAU_CONFIRM = 50 consecutive updates, and only if recent average HP is already above the PLATEAU_MIN_HP = 25 HP gate. No separate elite rule — elite stages run the same detector.

Early stages blow through quickly — maybe an hour of training each. Hard stages get as much time as they need. And I don’t have to manually decide when to advance, which is the productivity piece that actually matters. Manual advancement steals momentum every time you pause the run; automated advancement means I can leave training overnight and come back to a model that’s several stages further along.

Stage config lives in stage.json, written by StageConfig.to_dict in stages.py. Six keys: stage, deck_size_min, deck_size_max, upgrade_chance, card_pool, enemy_pool. Deck construction is weighted random draws from the stage’s card_pool into a deck sized within [deck_size_min, deck_size_max], with each card upgraded independently at upgrade_chance. Per _STAGE_DEFS that chance jumps from 0% (stage 0) to 10% (stages 1–8) to 30% (stages 9–12) — stepped, not a smooth ramp. Rollback — falling back to a prior stage if the agent regresses — I considered and never implemented. It hasn’t been needed.

What I’d do first next time

Start with the simplest possible version of the task. Before thinking about reward shaping or network architecture, identify the minimum viable game state: fewest cards, one enemy, no special mechanics. Train on that until win rate is above 80%. Add one dimension of complexity at a time — more cards, or more enemies, but not both at once. Only move on when the current stage is mastered.

Auto-advancement is worth the engineering cost. Manual stage advancement is a productivity killer — every pause kills momentum. The ~100 lines of plateau-detection code paid for themselves inside the first weekend of overnight runs.

And don’t trust the intuition that curriculum limits final performance. I was worried that if I trained on Stage 0 too long, the agent would “overfit” to simple combat and fail to generalize. That turned out to be almost entirely wrong. The fundamentals transfer — an agent that learned to block on Stage 0 keeps blocking on Stage 12.

Technical details

Stage config: stage.json — keys stage, deck_size_min, deck_size_max, upgrade_chance, card_pool, enemy_pool (see stages.py::StageConfig.to_dict)
Deck construction: weighted random draws from the stage’s card_pool; per-card upgrade chance 0% at stage 0, 10% at stages 1–8, 30% at stages 9–12
Advancement criteria: HP-based plateau detector (learner.py::check_plateau) on two sliding windows of average HP remaining
Plateau params: PLATEAU_WINDOW = 100 updates, PLATEAU_CONFIRM = 50 consecutive, PLATEAU_THRESHOLD = 0.5 HP delta, PLATEAU_MIN_HP = 25 HP gate
Rollback: considered but not implemented

Key takeaways

Curriculum isn’t a training wheel, it’s a necessity. For sufficiently complex tasks, you cannot bootstrap a useful policy from random initialization without one. The agent has to stand somewhere before it can climb.
Minimum viable task first. Don’t start with the full problem. Strip it down until a nearly-random policy can win sometimes. The whole curriculum rests on Stage 0 being easy enough that random play produces some wins.
Automate stage transitions. Manual advancement steals time and attention you could spend on bigger questions. Write the plateau detector early.
Each new stage should add one concept, not five. The goal is that the agent’s existing policy is still ~80% correct and it just has to learn the new piece. If you’re adding five concepts, the agent is back to random on a smaller domain.
Trust that fundamentals transfer. An agent that masters the easy version almost always generalizes up, not down. The Stage 0 policy — block hard hits, kill weak enemies first — stays correct all the way through Stage 12.