← Deep Dives

Slay the Spire 2 — What the Agent Is Actually Playing

Deep Dive · 2026-04-19 · 6 min read

I chose Slay the Spire 2 as the RL target because three structural properties make it a harder problem than it looks — combinatorial per-turn action space, partial observation, long-horizon credit assignment. This post walks the mechanics, but only far enough to explain the modeling decisions each one forced.

A turn in Slay the Spire 2 — two enemies, a hand of cards, HP and energy indicators, an End Turn button.
Player HP · 55/55
Energy · 3 / turn
Enemies · 1–6
Intent · next action
Hand · 5 cards
End Turn
A turn in Slay the Spire 2. Everything the RL agent cares about fits in this frame.

Why I chose this game as an RL target

Three properties made STS2 a harder RL problem than a first look suggested, and they are the reason I picked it over environments with smaller action spaces or fully-observable state.

The action space is combinatorial. A 5-card hand against 2 enemies with 3 energy has hundreds of legal turn sequences, most of which are bad. The agent doesn't just pick one action per turn; it picks a sequence, and the sequence branches on which cards it has drawn that turn. The action space changes every turn because the hand changes every turn. This is not the fixed-size discrete action space that most RL tutorial environments assume, and it's what drove the 61-logit-with-mask policy head decision in the network DD.

The state is partially observed. Three things are hidden: the order of the draw pile, the enemy intents beyond the current turn, and the composition of future card rewards. The draw pile is the biggest of these — the cards the agent will draw next turn are already determined by the shuffle, just not yet known. A policy that assumes "I'll be able to play Strike next turn" is making an inference about a distribution, not reading a fact. This is what forced the mean-pooled pile summary in the input-vector DD — a fixed-size "what's in the pile" signal rather than positional encoding of an unordered collection.

Credit assignment is long-horizon. A run is ~40 combats across three acts. A single card pickup in act 1 might be the reason the run wins or loses in act 3. The reward signal is sparse — the agent wins or loses a combat, wins or loses a run. The decisions that caused each outcome are dozens or hundreds of steps in the past. Any RL algorithm on this problem spends a lot of compute arguing with itself about which decisions mattered, and that directly shaped the reward-shaping iteration logged in the reward-shaping lesson.

The rest of this article is the mechanics walk-through, kept only as far as each mechanic is load-bearing for one of the decisions above.

The 30-second version

Deck-building roguelike: climb a spire one combat at a time. Each combat is a turn-based card game — draw five, spend energy, end turn, enemies resolve. Between combats, the player picks up new cards, rests, buys relics, and fights elites. Three acts; most runs end in act 2. Losing ends the run and resets to floor 1 with a fresh deck. That frames the episode: one combat is roughly an RL episode for training, a full run is the meta-level credit-assignment horizon that’s currently out of scope.

A turn — and one RL step per card play

Every turn has the same five phases. Only one of them is an agent action; the rest are environment. That asymmetry is the first modeling call I had to make: the agent’s step is play one card or end turn, not “commit a whole turn at once.”

Start turn
Energy resets to 3. Any start-of-turn powers trigger — Strength, Ritual, Barricade, the long tail of passive effects that change how the turn will play out.
Draw 5
From the shuffled draw pile. If the pile empties mid-draw, the discard pile shuffles back in.
Play cards
This is the only phase where the agent acts. Each card in hand costs some amount of energy; the agent pays, picks a target if the card needs one, and the effect resolves. The agent can play zero cards or play until it's out of energy or playable cards.
End turn
Remaining cards in hand go to the discard pile. Unused energy does not carry.
Enemies resolve
Each enemy performs the action they telegraphed in their intent icon. Damage, buffs, debuffs apply. If the player is at 0 HP, combat ends and so does the run.

Stepping at the card level (rather than “one step per turn”) is why the action space is 61 logits of card-slot × target-slot plus end-turn, instead of a combinatorial enumeration of whole turns. The network picks one play, the environment ticks, and the state refreshes for the next play.

Cards are the verbs

Every card is exactly one of three types. The split matters because it forced a decision about how to encode card type for the network — as keyword-flag scalars on each hand slot, not as a separate head, because the interactions (Strength boosts Attacks only; Weak reduces Attacks only) are cheap to learn from a flag and expensive to learn from an ID alone.

Attack cards deal damage. Strike deals 6 damage for 1 energy. Most attacks target a single enemy; some are multi-target or "deal X to all enemies." Attacks are the only cards that benefit from Strength (the player's damage-boost power) and the only cards the enemy's Weak debuff reduces.

Skill cards are everything non-attack that resolves immediately. Defend gains 5 block for 1 energy. Card draw, targeted debuffs, self-heals, energy gain, enemy-pile manipulation — all Skills. They don't scale with Strength and they're not Weakened by Weak.

Power cards are persistent effects that stay on for the rest of the combat. Inflame adds 2 permanent Strength. Metallicize grants 3 block at the end of every turn. Powers leave the hand on play — they don't go to the discard pile, because conceptually they're no longer a card. They've been played and now they're part of the player.

The three-way split matters more than it looks. Attack vs Skill is a relevance filter for a handful of per-turn modifiers. Power vs the other two is a credit-assignment problem: a Power played on turn 1 pays off for the next eight turns. An Attack played on turn 8 pays off on turn 8.

Intent — reading, not predicting

Decision

Read the next enemy action from the intent icon; don’t train the agent to predict it. Intent is the central partial-observation signal of the game, and the whole shape of the state vector changes depending on whether you feed it in or make the policy infer it.

Every enemy in Slay the Spire 2 displays, above their sprite, an icon showing what they will do on their next action. Sword icon: they will attack for some amount. Shield: they will gain block. Curved-line icon: they will buff themselves or debuff you. The icon includes the relevant number — attack for 21, block for 8.

This is an unusual mechanic for a turn-based game. The player knows, before committing the turn, what each enemy will do when their turn comes. The uncertainty in the state is about future turns and the order of the draw pile — not about the next enemy action, which is public information.

For the agent, this shows up directly in the state vector. src/AutoSlayRestartMod/Rl/StateSerializer.cs walks each enemy's intent at combat time and serializes it as four category flags (attack, block, buff, debuff) plus the scalar number attached to the action. The agent does not have to predict the next action; it reads it. The interesting decisions are about which card to play against a known threat, not guessing which threat is coming.

Status effects drive the scalar budget

Four status effects do most of the work, and they’re the reason each enemy slot needs several scalar stacks on top of its embedding — you can’t infer “Vulnerable 2” from an enemy ID, it has to be a number per slot. The normalization caps on those scalars are the quiet footgun documented in the input-vector lesson.

Poison — ticks every start-of-turn, deals that many damage, decrements by 1. Weak — reduces affected side's attack damage by 25%. Stacks in duration, not severity. Vulnerable — increases affected side's incoming attack damage by 50%. Same stacking as Weak. Block — soaks the next incoming damage. Resets to 0 at the start of the player's next turn.

Poison compounds: a stack of 10 over three turns is 27 total damage (10 + 9 + 8). Block resets almost always — a small set of relics and powers carry a fraction across turns, which is why the agent has to learn "almost" rather than "always."

These four are why a Slay the Spire 2 combat is combinatorial rather than additive. A turn that deals 30 damage into a vulnerable target that was also weak hits different from the same 30 damage against a clean target. The order of operations within a turn — Weak the enemy first, then attack, then apply Vulnerable for next turn — is most of the agent's job.

Two views of the same combat

Left · what the player sees

2D art, animated intent icons, card hover tooltips, energy orbs, the whole game UI.

  • Screenshot at the top of this article.
  • Animations make the intent obvious at a glance.
  • Tooltips explain every card on hover.
Right · what the agent sees

A 583-dim float vector — embeddings, scalars, pooled pile summaries. Numbers.

  • Card embeddings, enemy embeddings.
  • Scalar HP, energy, intent flags, status-effect stacks.
  • Mean-pooled summaries of draw and discard piles.

Same game state. The agent's view is the one the input-vector deep-dive lays out in detail.

The serialization happens in the mod that hooks into the game (StateSerializer.cs, already referenced above in the intent section). Every combat state gets turned into a JSON blob, shipped to the training process over a local socket, and the training process converts it to the tensor the network reads. The interesting part isn't the socket — it's the translation. The game speaks in objects and references; the network speaks in floats; and the translation is where most of the interesting design decisions live.

Net: the three structural properties at the top — combinatorial action space, partial observation, long-horizon credit assignment — are why I trained on this game instead of one with a smaller action space or fully-observable state. If the agent learns, the thing it learned is worth something. If it doesn’t, the thing that didn’t work is interesting to diagnose. The rest of the site is the diagnosing.

Slay the Spire 2 is a trademark of Mega Crit Games. This is an explainer; I am not affiliated with Mega Crit.