Reinforcement Learning, Grounded In One Project
Four files, one on-policy PPO learner, no replay buffer, no target network, no broker between actor and learner. This is a tour of what those four files commit to, and why each commitment looked like the lowest-risk call for a real-time STS2 environment.
EPISODES_PER_UPDATE. Actors pull fresh weights every learner step — bounds the actor-learner drift PPO depends on.Six decisions, four files. The rest of this post is where each decision lives in the repo.
How the pieces fit
Each of the four files is one of those decisions made concrete. I'm skipping the agent-environment-arrows cartoon — the DD reader doesn't need it, and pointing at code is faster than relabeling a diagram.
The training pipeline splits across two processes. An actor proxy runs the agent-environment loop inside a patched STS2 build and collects rollouts. A central learner pulls rollouts off a TCP socket and runs PPO. That split is a distributed-systems detail with its own write-up; here the two processes are treated as one RL pipeline, with each of the six words pinned to whichever file owns it.
Stop 1 — model.py (policy and value)
StsPolicy in src/bot/model.py is one network
with two (really three) heads. Trunk is two Linear → LayerNorm → ReLU
blocks at HIDDEN_DIM = 256. policy_head is
Linear(256, MAX_ACTIONS) with MAX_ACTIONS = 61
(1 end-turn + 10 hand slots × 6 target slots). value_head
is Linear(256, 1). A third discard_head is
Linear(256, MAX_HAND) and scores which card to pitch at
mid-turn discard prompts. Embeddings for cards, enemies, relics, and
powers feed the trunk — the input-vector article covers those.
(A fourth head, choice_proj, was added later for card-choice
events; it sits outside the PPO loss and has its own write-up — see the
Card-Choice Head deep
dive.)
def forward(self, enc):
h = self._trunk_forward(enc)
logits = self.policy_head(h)
value = self.value_head(h)
discard_scores = self.discard_head(h)
logits = logits.masked_fill(~enc.action_mask, -1e9)
return logits, value, discard_scores
The policy head outputs 61 logits. Most of them aren't legal on any
given turn — the hand reshuffles, enemies die, some cards exhaust. The
environment supplies an action mask; illegal slots get -1e9
before softmax so their probability is zero. Fixed action space,
dynamic legality — that's what makes the head compatible with a game
where the move set changes every step.
Shared trunk, two (really three) heads, one forward pass. Not a clever architecture — it's the shape PPO assumes and every reference implementation uses. The reason is pragmatic: PPO's loss combines a policy term and a value term, and running them through one backbone means both objectives flow gradients through the same features. Two separate networks would double the parameter count and force the value head to learn from scratch what the policy head has already learned about "good state, bad state."
Stop 2 — reward.py (the scalar the agent is optimizing)
EpisodeRewardTracker in src/bot/reward.py
assembles the reward from three components.
DEFAULT_DAMAGE_SCALE = 2.0
TURN_PENALTY = 0.01
class EpisodeRewardTracker:
def step_reward(self, curr_hp): ... # -(damage_taken / max_hp) * 2.0
def turn_penalty(self): ... # -0.01 per end_turn
def terminal_reward(self, won, hp_remaining):
return (1.0 + hp_remaining/max_hp) if won else -1.0
Terminal: +1.0 + hp_fraction for a win, -1.0
for a loss. Per-step damage: a fraction of max HP lost this step,
scaled by 2. The scale is deliberate — a full-HP bleed-out over an
episode sums to -2.0, which is the same magnitude as the
terminal reward. Losing is never much worse than dying slowly, and
dying slowly is never much better than dying fast. Turn penalty: a
flat -0.01 every time end_turn fires.
The module docstring pins the scale to example outcomes:
Perfect game (win, 0 damage, 3 turns): +1.97
Win, 20 HP lost (70 max), 5 turns: +1.09
Barely win (60 HP lost, 8 turns): -0.65
Clean loss: -1.00
None of that is RL theory. It's scale design, and it's downstream of what I want the policy to prefer. Want the policy to prefer fast wins? Add a turn penalty. Want the policy to prefer surviving cleanly over surviving barely? Make the terminal bonus proportional to HP remaining. The agent's entire learned behavior is shaped by these four lines of Python, and if I got any of them wrong, the policy would optimize exactly what I specified and I'd have to go figure out why. (I did, several times — there's a dedicated article on that iteration.)
Stop 3 — actor_proxy.py (the agent-environment loop)
The loop lives in handle_connection in
src/bot/actor_proxy.py. Most of the complexity is in
what it's wrapping — the environment is a patched Steam game and the
step takes hundreds of milliseconds — but the RL shape is a textbook
five-line loop:
while not done:
enc = encode_state(state, legal, card_vocab, enemy_vocab,
relic_vocab, power_vocab)
action_idx, log_prob, _, value, _ = policy.get_action(enc)
result = env.step(action_index_to_dict(action_idx), ...)
reward = (tracker.terminal_reward(...) if result.done
else tracker.step_reward(result.state["player"]["hp"]))
if action_dict["type"] == "end_turn":
reward += tracker.turn_penalty()
raw_transitions.append({
"state": state, "legal_actions": legal,
"action": action_idx, "log_prob": step_log_prob,
"value": value.item(), "reward": reward, "done": result.done,
})
done = result.done
...
Each transition is a dict with seven fields. At done,
the list is shipped to the learner over TCP.
The thing to carry from this stop into the PPO stop: this is an
on-policy algorithm. The rollouts the learner uses
for PPO have to come from the policy the learner is updating. The
actor has a copy of that policy, pulled from the learner
periodically; the knob for how periodically is
WEIGHT_PULL_EVERY = 16. Every sixteen episodes, the
actor pulls fresh weights. Too infrequent and the actor's policy
drifts far enough from the learner's that PPO's importance-sampling
ratio becomes nonsense; too frequent and the actors spend more time
on TCP than on combat.
Off-policy methods (DQN, SAC) sidestep the drift problem by estimating action-values directly — they can train on any policy's rollouts, at the cost of replay buffers, target networks, and a more complicated set of moving parts. PPO's tradeoff is to stay on-policy and bound the per-update policy shift; the simpler pipeline is what I'd call this project's main reason for picking PPO, though I didn't think that hard about it at the time.
Stop 4 — learner.py (PPO itself)
ppo_update in src/bot/learner.py is where
everything above turns into a gradient step. The function body is
about 75 lines. The hyperparameter block at the top of the file is
the list of knobs. Two stanzas carry most of the weight.
First, the clipped surrogate loss (learner.py lines 310–314):
ratio = (new_lp - mb_olp).exp()
clipped = ratio.clamp(1 - CLIP_EPS, 1 + CLIP_EPS)
pi_loss = -torch.min(ratio * mb_adv, clipped * mb_adv).mean()
val_loss = F.mse_loss(values.squeeze(-1), mb_ret)
loss = pi_loss + VALUE_COEF * val_loss - ENTROPY_COEF * entropy ratio is π_new(a|s) / π_old(a|s) for the action that was
taken in the rollout — new-policy probability over old-policy
probability, computed as exp(new_log_prob - old_log_prob)
for numerical sanity. clipped bounds it to
[0.8, 1.2] with CLIP_EPS = 0.2.
torch.min(ratio * mb_adv, clipped * mb_adv) is the move
that makes PPO "proximal." If the update helps and the policy wants
to move more than 20% — the clipped term caps the gradient. If the
update hurts — the unclipped term pays full penalty. Pessimistic
about gains, honest about losses. That asymmetry replaces the KL
constraint older methods (TRPO) enforce explicitly.
val_loss is MSE between the value head's prediction and
the GAE-computed return. The entropy bonus
(-ENTROPY_COEF * entropy) rewards the policy for not
collapsing onto one action too quickly.
Second, compute_gae (lines 235–242):
for t in reversed(range(n)):
tr = transitions[t]
delta = tr.reward + GAMMA * nv * (0.0 if tr.done else 1.0) - tr.value
gae = delta + GAMMA * GAE_LAMBDA * (0.0 if tr.done else 1.0) * gae
advs[t] = gae
rets[t] = gae + tr.value
nv = tr.value
Generalized Advantage Estimation is what produces the
mb_adv the surrogate multiplies against. Walk the
trajectory in reverse. delta_t = r_t + γ·V_{t+1} − V_t
is the one-step TD error — the amount the actual outcome beat the
value function's prediction. A_t = δ_t + γ·λ·A_{t+1}
smooths that error backward, weighted by γ·λ per step.
GAE_LAMBDA = 0.95 is the bias-variance knob — at λ=1 the
advantage is the full Monte Carlo return minus the baseline
(unbiased, high variance); at λ=0 it's the one-step TD error (biased
by V's errors, low variance); 0.95 is what the original GAE paper
lands on and what virtually every PPO reference implementation uses.
The (0.0 if tr.done else 1.0) factors zero out the value
bootstrap across episode terminations. Without them the advantage
would leak across episode boundaries and the agent would train as if
death had value.
Every 16 episodes, this cycle runs once.
1. Rollouts arrive on TCP. 2. resize_embeddings grows
tables if vocabs expanded. 3. compute_gae produces
advantages and returns. 4. Advantages normalized across the full batch.
5. Shuffle indices, minibatches of 64. 6. Forward pass per minibatch.
7. Clipped loss + value MSE + entropy → backprop → clip_grad_norm.
8. optimizer.step().
Four passes (PPO_EPOCHS = 4) through the batch before the
rollouts are too stale to reuse. Every sixteen episodes
(EPISODES_PER_UPDATE = 16), this cycle runs once.
Hyperparameters
| Constant | Value | Why |
|---|---|---|
GAMMA | 0.99 | Standard RL discount; episodes are short, far-future contribution near zero. |
GAE_LAMBDA | 0.95 | PPO-paper default. Bias-variance compromise. |
CLIP_EPS | 0.2 | PPO-paper default. Max 20% policy move per update. |
VALUE_COEF | 0.5 | PPO-paper default. Weight of the value-loss term. |
ENTROPY_COEF | 0.04 | Higher than the usual 0.01 — 61-logit masked action space has low effective entropy. |
LR | 3e-4 | Default for Adam on a small net. |
MAX_GRAD_NORM | 0.5 | Prevents a pathological batch from kicking the policy past the clip range. |
PPO_EPOCHS | 4 | Four passes before rollouts are too off-policy to reuse. |
Seven of eight are PPO-paper defaults. The one I tuned is
ENTROPY_COEF. Per-step masking means the policy
distribution is usually over 5–15 legal actions rather than 61, so the
raw entropy term is smaller than what the Atari-benchmark-tuned
default assumes. Multiplying by roughly 4× puts the exploration
pressure back where the paper had it, which is about the least
interesting tune I've ever made — and the only one that mattered.
What I'd tell someone learning RL from this project
-
Start with
reward.py. It's the smallest file and the most load-bearing — every behavior the agent ever learns is downstream of this scalar. -
Read
actor_proxy.py::handle_connectionnext. The five-line loop is the clearest way to see how RL actually runs; everything upstream feeds it, everything downstream consumes it. -
Then
model.py. Skipencode_stateon the first read. Start atStsPolicy.forward— that's the part that maps state to(logits, value). -
Save
learner.py::ppo_updatefor last. With the other three files loaded, it reads as a 75-line script end to end.