← Deep Dives

Reinforcement Learning, Grounded In One Project

Deep Dive · 2026-04-19 · 8 min read

Four files, one on-policy PPO learner, no replay buffer, no target network, no broker between actor and learner. This is a tour of what those four files commit to, and why each commitment looked like the lowest-risk call for a real-time STS2 environment.

PPO, on-policy
Picked over DQN/SAC for the smaller pipeline — no replay buffer, no target network, no off-policy correction to babysit.
One trunk, three heads
Policy + value + discard share a 256-dim backbone. PPO assumes it; one net means both objectives gradient-share through the same features.
Scale-matched reward
Per-step damage and terminal outcome both top out near ±2.0 — dying slowly is never much worse than dying fast, so the policy can't learn to stall.
WEIGHT_PULL_EVERY = 16
Tied to EPISODES_PER_UPDATE. Actors pull fresh weights every learner step — bounds the actor-learner drift PPO depends on.
ENTROPY_COEF = 0.04
4× the paper default. 61-logit action space gets masked down to 5–15 legal actions per step, so raw entropy is smaller than Atari-tuned defaults assume.
Two processes, TCP between
Actor proxy runs the env loop in patched STS2; learner runs PPO. One TCP socket, no broker — the distributed-systems write-up has the rest.

Six decisions, four files. The rest of this post is where each decision lives in the repo.

How the pieces fit

Each of the four files is one of those decisions made concrete. I'm skipping the agent-environment-arrows cartoon — the DD reader doesn't need it, and pointing at code is faster than relabeling a diagram.

The training pipeline splits across two processes. An actor proxy runs the agent-environment loop inside a patched STS2 build and collects rollouts. A central learner pulls rollouts off a TCP socket and runs PPO. That split is a distributed-systems detail with its own write-up; here the two processes are treated as one RL pipeline, with each of the six words pinned to whichever file owns it.

Policy · Value

Stop 1 — model.py (policy and value)

StsPolicy in src/bot/model.py is one network with two (really three) heads. Trunk is two Linear → LayerNorm → ReLU blocks at HIDDEN_DIM = 256. policy_head is Linear(256, MAX_ACTIONS) with MAX_ACTIONS = 61 (1 end-turn + 10 hand slots × 6 target slots). value_head is Linear(256, 1). A third discard_head is Linear(256, MAX_HAND) and scores which card to pitch at mid-turn discard prompts. Embeddings for cards, enemies, relics, and powers feed the trunk — the input-vector article covers those. (A fourth head, choice_proj, was added later for card-choice events; it sits outside the PPO loss and has its own write-up — see the Card-Choice Head deep dive.)

def forward(self, enc):
    h = self._trunk_forward(enc)
    logits = self.policy_head(h)
    value  = self.value_head(h)
    discard_scores = self.discard_head(h)
    logits = logits.masked_fill(~enc.action_mask, -1e9)
    return logits, value, discard_scores

The policy head outputs 61 logits. Most of them aren't legal on any given turn — the hand reshuffles, enemies die, some cards exhaust. The environment supplies an action mask; illegal slots get -1e9 before softmax so their probability is zero. Fixed action space, dynamic legality — that's what makes the head compatible with a game where the move set changes every step.

Shared trunk, two (really three) heads, one forward pass. Not a clever architecture — it's the shape PPO assumes and every reference implementation uses. The reason is pragmatic: PPO's loss combines a policy term and a value term, and running them through one backbone means both objectives flow gradients through the same features. Two separate networks would double the parameter count and force the value head to learn from scratch what the policy head has already learned about "good state, bad state."

Reward

Stop 2 — reward.py (the scalar the agent is optimizing)

EpisodeRewardTracker in src/bot/reward.py assembles the reward from three components.

DEFAULT_DAMAGE_SCALE = 2.0
TURN_PENALTY         = 0.01

class EpisodeRewardTracker:
    def step_reward(self, curr_hp):     ...   # -(damage_taken / max_hp) * 2.0
    def turn_penalty(self):             ...   # -0.01 per end_turn
    def terminal_reward(self, won, hp_remaining):
        return (1.0 + hp_remaining/max_hp) if won else -1.0

Terminal: +1.0 + hp_fraction for a win, -1.0 for a loss. Per-step damage: a fraction of max HP lost this step, scaled by 2. The scale is deliberate — a full-HP bleed-out over an episode sums to -2.0, which is the same magnitude as the terminal reward. Losing is never much worse than dying slowly, and dying slowly is never much better than dying fast. Turn penalty: a flat -0.01 every time end_turn fires.

The module docstring pins the scale to example outcomes:

Perfect game (win, 0 damage, 3 turns): +1.97
Win, 20 HP lost (70 max), 5 turns: +1.09
Barely win (60 HP lost, 8 turns): -0.65
Clean loss: -1.00

None of that is RL theory. It's scale design, and it's downstream of what I want the policy to prefer. Want the policy to prefer fast wins? Add a turn penalty. Want the policy to prefer surviving cleanly over surviving barely? Make the terminal bonus proportional to HP remaining. The agent's entire learned behavior is shaped by these four lines of Python, and if I got any of them wrong, the policy would optimize exactly what I specified and I'd have to go figure out why. (I did, several times — there's a dedicated article on that iteration.)

Agent + Environment

Stop 3 — actor_proxy.py (the agent-environment loop)

The loop lives in handle_connection in src/bot/actor_proxy.py. Most of the complexity is in what it's wrapping — the environment is a patched Steam game and the step takes hundreds of milliseconds — but the RL shape is a textbook five-line loop:

while not done:
    enc = encode_state(state, legal, card_vocab, enemy_vocab,
                       relic_vocab, power_vocab)
    action_idx, log_prob, _, value, _ = policy.get_action(enc)
    result = env.step(action_index_to_dict(action_idx), ...)
    reward = (tracker.terminal_reward(...) if result.done
              else tracker.step_reward(result.state["player"]["hp"]))
    if action_dict["type"] == "end_turn":
        reward += tracker.turn_penalty()
    raw_transitions.append({
        "state": state, "legal_actions": legal,
        "action": action_idx, "log_prob": step_log_prob,
        "value": value.item(), "reward": reward, "done": result.done,
    })
    done = result.done
    ...

Each transition is a dict with seven fields. At done, the list is shipped to the learner over TCP.

The thing to carry from this stop into the PPO stop: this is an on-policy algorithm. The rollouts the learner uses for PPO have to come from the policy the learner is updating. The actor has a copy of that policy, pulled from the learner periodically; the knob for how periodically is WEIGHT_PULL_EVERY = 16. Every sixteen episodes, the actor pulls fresh weights. Too infrequent and the actor's policy drifts far enough from the learner's that PPO's importance-sampling ratio becomes nonsense; too frequent and the actors spend more time on TCP than on combat.

Off-policy methods (DQN, SAC) sidestep the drift problem by estimating action-values directly — they can train on any policy's rollouts, at the cost of replay buffers, target networks, and a more complicated set of moving parts. PPO's tradeoff is to stay on-policy and bound the per-update policy shift; the simpler pipeline is what I'd call this project's main reason for picking PPO, though I didn't think that hard about it at the time.

PPO Update

Stop 4 — learner.py (PPO itself)

ppo_update in src/bot/learner.py is where everything above turns into a gradient step. The function body is about 75 lines. The hyperparameter block at the top of the file is the list of knobs. Two stanzas carry most of the weight.

First, the clipped surrogate loss (learner.py lines 310–314):

ratio    = (new_lp - mb_olp).exp()
clipped  = ratio.clamp(1 - CLIP_EPS, 1 + CLIP_EPS)
pi_loss  = -torch.min(ratio * mb_adv, clipped * mb_adv).mean()
val_loss = F.mse_loss(values.squeeze(-1), mb_ret)
loss     = pi_loss + VALUE_COEF * val_loss - ENTROPY_COEF * entropy

ratio is π_new(a|s) / π_old(a|s) for the action that was taken in the rollout — new-policy probability over old-policy probability, computed as exp(new_log_prob - old_log_prob) for numerical sanity. clipped bounds it to [0.8, 1.2] with CLIP_EPS = 0.2. torch.min(ratio * mb_adv, clipped * mb_adv) is the move that makes PPO "proximal." If the update helps and the policy wants to move more than 20% — the clipped term caps the gradient. If the update hurts — the unclipped term pays full penalty. Pessimistic about gains, honest about losses. That asymmetry replaces the KL constraint older methods (TRPO) enforce explicitly.

val_loss is MSE between the value head's prediction and the GAE-computed return. The entropy bonus (-ENTROPY_COEF * entropy) rewards the policy for not collapsing onto one action too quickly.

Second, compute_gae (lines 235–242):

for t in reversed(range(n)):
    tr    = transitions[t]
    delta = tr.reward + GAMMA * nv * (0.0 if tr.done else 1.0) - tr.value
    gae   = delta + GAMMA * GAE_LAMBDA * (0.0 if tr.done else 1.0) * gae
    advs[t] = gae
    rets[t] = gae + tr.value
    nv = tr.value

Generalized Advantage Estimation is what produces the mb_adv the surrogate multiplies against. Walk the trajectory in reverse. delta_t = r_t + γ·V_{t+1} − V_t is the one-step TD error — the amount the actual outcome beat the value function's prediction. A_t = δ_t + γ·λ·A_{t+1} smooths that error backward, weighted by γ·λ per step. GAE_LAMBDA = 0.95 is the bias-variance knob — at λ=1 the advantage is the full Monte Carlo return minus the baseline (unbiased, high variance); at λ=0 it's the one-step TD error (biased by V's errors, low variance); 0.95 is what the original GAE paper lands on and what virtually every PPO reference implementation uses.

The (0.0 if tr.done else 1.0) factors zero out the value bootstrap across episode terminations. Without them the advantage would leak across episode boundaries and the agent would train as if death had value.

One update, start to finish 1 rollouts arrive 2 resize embeddings 3 compute GAE 4 normalize advantages 5 sample minibatches 6 forward pass 7 clip + backprop 8 optimizer .step() next 16 episodes

Every 16 episodes, this cycle runs once.

1. Rollouts arrive on TCP. 2. resize_embeddings grows tables if vocabs expanded. 3. compute_gae produces advantages and returns. 4. Advantages normalized across the full batch. 5. Shuffle indices, minibatches of 64. 6. Forward pass per minibatch. 7. Clipped loss + value MSE + entropy → backprop → clip_grad_norm. 8. optimizer.step().

Four passes (PPO_EPOCHS = 4) through the batch before the rollouts are too stale to reuse. Every sixteen episodes (EPISODES_PER_UPDATE = 16), this cycle runs once.

Hyperparameters

ConstantValueWhy
GAMMA0.99Standard RL discount; episodes are short, far-future contribution near zero.
GAE_LAMBDA0.95PPO-paper default. Bias-variance compromise.
CLIP_EPS0.2PPO-paper default. Max 20% policy move per update.
VALUE_COEF0.5PPO-paper default. Weight of the value-loss term.
ENTROPY_COEF0.04Higher than the usual 0.01 — 61-logit masked action space has low effective entropy.
LR3e-4Default for Adam on a small net.
MAX_GRAD_NORM0.5Prevents a pathological batch from kicking the policy past the clip range.
PPO_EPOCHS4Four passes before rollouts are too off-policy to reuse.

Seven of eight are PPO-paper defaults. The one I tuned is ENTROPY_COEF. Per-step masking means the policy distribution is usually over 5–15 legal actions rather than 61, so the raw entropy term is smaller than what the Atari-benchmark-tuned default assumes. Multiplying by roughly 4× puts the exploration pressure back where the paper had it, which is about the least interesting tune I've ever made — and the only one that mattered.

What I'd tell someone learning RL from this project

  1. Start with reward.py. It's the smallest file and the most load-bearing — every behavior the agent ever learns is downstream of this scalar.
  2. Read actor_proxy.py::handle_connection next. The five-line loop is the clearest way to see how RL actually runs; everything upstream feeds it, everything downstream consumes it.
  3. Then model.py. Skip encode_state on the first read. Start at StsPolicy.forward — that's the part that maps state to (logits, value).
  4. Save learner.py::ppo_update for last. With the other three files loaded, it reads as a 75-line script end to end.