← Deep Dives

Distributed compute — eleven laptops, one TCP port, and a scheduled task

Deep Dive · 2026-04-19 · 8 min read

Eleven Windows laptops run an actor proxy each. One PPO learner runs on my desk. They talk over a single TCP port on the home LAN. This post is what the compute actually looks like today — numbers, configs, and the wire.

The fleet today · 11 machines · ~175 cpm
stsBot0 10.0.0.50 25 cpm
stsBot1 10.0.0.54 14 cpm
stsBot2 10.0.0.73 14 cpm
stsBot3 10.0.0.77 — cpm · stale 21 min
stsBot4 10.0.0.70 27 cpm
stsBot5 10.0.0.69 — cpm · fresh boot
stsBot6 10.0.0.64 20 cpm
stsBot7 10.0.0.84 37 cpm · top
stsBot8 10.0.0.60 — cpm · offline ~11h
stsBot9 10.0.0.75 13 cpm
stsBot10 10.0.0.82 25 cpm

Eleven laptops. Ten connected. Eight producing. One PPO learner, one TCP port, one restart script. That's the whole distributed surface.

The sibling article over on /learned/distributed-deployment covers how I got the fleet to stay up at all — the schtasks trick, the kill-by-window-title pattern, the three per-worker incidents I kept running into. This post takes the fleet as given and walks through what it's actually doing right now: the configs on each side of the wire, the cadence that keeps them in sync, and the numbers coming out of it.

The fleet today

Each tile above is one Windows laptop running actor_proxy.py (src/bot/actor_proxy.py) against the learner on my desk. The IPs are static name=ip entries from workers.txt — no dynamic discovery, no service registry:

stsBot0=10.0.0.50
stsBot1=10.0.0.54
stsBot2=10.0.0.73
...
stsBot10=10.0.0.82

Eleven lines. The file's the truth; restart_workers.bat reads it when I want to roll out a change, and each worker's workers.txt.example-derived copy tells it where the learner lives.

The CPM numbers on the tiles come from data/workers.json, a file the learner rewrites every five seconds. Each entry is one worker's current state: ip, connected, episodes_total, combats_per_min, last_seen_s. A snapshot from a moment earlier today:

[
  {"ip": "10.0.0.50", "connected": true,  "episodes_total": 33622, "combats_per_min": 25, "last_seen_s": 1.0},
  {"ip": "10.0.0.54", "connected": true,  "episodes_total": 15793, "combats_per_min": 14, "last_seen_s": 3.0},
  {"ip": "10.0.0.77", "connected": true,  "episodes_total": 34687, "combats_per_min": 0,  "last_seen_s": 1263.0},
  {"ip": "10.0.0.70", "connected": true,  "episodes_total": 41194, "combats_per_min": 27, "last_seen_s": 2.0},
  {"ip": "10.0.0.84", "connected": true,  "episodes_total": 39600, "combats_per_min": 37, "last_seen_s": 1.0},
  {"ip": "10.0.0.60", "connected": false, "episodes_total":  7217, "combats_per_min": 0,  "last_seen_s": 40049.0}
]

Three workers sit at zero CPM in that snapshot. stsBot3 is connected but hasn't pushed an episode in 1263 seconds — about twenty-one minutes — so its 60-second rolling CPM has decayed to zero. stsBot5 booted fresh from restart_workers.bat not long before and only has 420 episodes on the counter. stsBot8 has been disconnected for 40,049 seconds, which is roughly eleven hours — I shut its lid yesterday and forgot. The other eight are producing.

Fleet CPM at that moment: 175. Fleet episodes-to-date at that snapshot: about 259,000 (the home-page tile is live and will read higher).

The configs that make this go

Three files set the cadence of the whole system. Two live on the worker side, one on the learner.

worker_config.toml
[learner]
host = "127.0.0.1"
port = 9998

[actor]
game_port = 9999
headless  = true
instances = 3
actor_proxy.py · constants
WEIGHT_PULL_EVERY = 16
_learner_host     = "127.0.0.1"
_learner_port     = 9998
learner.py · hyperparameters
GAMMA               = 0.99
GAE_LAMBDA          = 0.95
CLIP_EPS            = 0.2
VALUE_COEF          = 0.5
ENTROPY_COEF        = 0.04
LR                  = 3e-4
MAX_GRAD_NORM       = 0.5

EPISODES_PER_UPDATE = 16
PPO_EPOCHS          = 4
MINIBATCH_SIZE      = 64

learner.host is the one value that changes per worker, because on my desk the learner is loopback but on a worker it's the LAN address of my desk machine. actor.instances is how many headless game clients the proxy spawns; actor.game_port is the port the proxy listens on for those clients. Nothing ever reaches the rest of the network from that port — it's worker-local.

WEIGHT_PULL_EVERY is the cadence that matters most. Every sixteen completed episodes, the proxy blocks on a pull_weights() call and loads the updated policy state dict. Between pulls, it runs inference with whatever weights it has locally.

The interesting coincidence is EPISODES_PER_UPDATE and WEIGHT_PULL_EVERY both being sixteen. The learner waits for sixteen new episodes to show up on its rollout queue, runs four PPO epochs over them in minibatches of sixty-four, then the next actor that asks for weights gets the freshly updated policy. Across eleven workers producing at mixed rates, some actors will pull weights several updates behind and some will be right on the leading edge. That's fine — PPO's clipped objective is what keeps the stale-policy rollouts usable.

The wire

One TCP port, 9998. That's the whole distributed surface.

Data flow · actor ↔ learner
   11x actor_proxy.py     (workers, :9999 to their local games)
            │  episodes (one per completed combat)
            ▼
       TCP :9998          ◄── [the only distributed port]
            │  weights (pulled every 16 episodes)
            ▼
         learner.py       (host PC, single GPU)

No message broker, no Redis, no gRPC, no service mesh. Each actor opens one long-lived TCP socket to the learner on startup (_connect_learner(), actor_proxy.py:85), serializes messages with the in-house protocol.py framing, and pushes one episode per completed combat.

Episodes up

Inside handle_connection() (actor_proxy.py:199), after each combat ends, the proxy builds a dict of {transitions, summary, completed_at} and calls push_rollout(ep_dict) (actor_proxy.py:113). That function grabs _learner_lock — the mutex that serializes all actor↔learner traffic on the shared socket — sends the message, and blocks on the learner's ack before releasing the lock. The ack carries the learner's current update count, which the proxy uses to decide whether it's time to pull weights.

The reason the mutex matters: a single proxy with multiple game threads pushes rollouts concurrently. Without the lock, two send_msg calls would interleave on the wire and the framing would fall apart. Lock, send, recv, unlock. One request at a time.

Weights down

Every sixteenth completed episode, pull_weights() (actor_proxy.py:135) asks the learner for the current state dict. The learner replies with a payload that includes four vocab token lists (card, enemy, relic, power — these grow over time as new entities show up in training), the pickled state dict bytes, and the current stage_config. The proxy resizes its local embeddings to match the learner's vocabs (policy.resize_embeddings, line 164), loads the state dict, and writes out a fresh stage.json for the C# mod to pick up.

The weight transfer is the biggest message on the wire — a full state dict is in the megabytes — but it only happens every sixteen episodes per worker. Eleven workers averaging ~16 CPM each means a weight pull every minute or so per worker, staggered across machines. Aggregate learner-side weight-serve load is small.

What the fleet produces

Eleven rows, sorted top to bottom by current CPM:

stsBot737
stsBot427
stsBot025
stsBot1025
stsBot620
stsBot114
stsBot214
stsBot913
stsBot30 · stale 21m
stsBot50 · fresh boot
stsBot80 · offline ~11h

fleet ≈ 175 cpm · snapshot from data/workers.json

A few things to read off this. stsBot7 is the fastest laptop I've got and routinely runs at the top of the leaderboard. It's a desktop-replacement with better thermals than the rest — the CPM difference is almost entirely thermal. stsBot4, stsBot0, and stsBot10 are second-tier machines that hover in the mid-20s. stsBot1, stsBot2, and stsBot9 are older laptops that throttle harder under sustained load and sit in the low teens.

The three zeros each come from a different state. stsBot3 was connected at snapshot time but hadn't pushed an episode in over twenty minutes. Almost certainly the game UI got stuck on a deck-selection screen the mod couldn't auto-advance; the proxy is waiting on a combat that isn't coming. stsBot5 had just restarted and only logged 420 episodes before the snapshot, which is fewer than one PPO update's worth. stsBot8 is a laptop whose lid I closed the night before and didn't reopen; connected=false flipped once the learner's worker-stats deadline passed.

combats_per_min itself is a rolling 60-second window on timestamps, stored in a deque(maxlen=200) per worker (WorkerStats._timestamps, learner.py:155). The learner writes the worker table out to data/workers.json every five seconds (_write_workers_loop, learner.py:181), so the on-disk CPM is never more than five seconds stale.

Rolling out changes

One script handles fleet-wide code updates: scripts/main/restart_workers.bat. It reads workers.txt, iterates, and for each worker runs:

  1. taskkill by window title to stop the running proxy and games
  2. git fetch origin && git reset --hard origin/main to pull fresh code
  3. schtasks /create /tn STS2Worker ... then /run then /delete to re-launch run_worker.bat in a detached session

The schtasks trick is the weird one and is covered in the retrospective. The relevant line from restart_workers.bat:82:

ssh %SSH_USER%@!WIP_%%i! "schtasks /create /tn STS2Worker /tr \"cmd.exe /c cd /d %REMOTE_REPO%\scripts\worker ^& run_worker.bat\" /sc once /st 00:00 /f >nul 2>&1 && schtasks /run /tn STS2Worker >nul 2>&1 && schtasks /delete /tn STS2Worker /f >nul 2>&1"

Sequential, one worker at a time. Eleven workers take about a minute end-to-end. The loop is not parallelized — the errors are easier to read one at a time, and for a home LAN the serial cost is cheap enough not to matter.

On each worker, run_worker.bat (scripts/worker/run_worker.bat) reads worker_config.toml, installs Python dependencies, launches the actor proxy in one cmd window (titled STS2 Actor Proxy), and then launches actor.instances headless game clients spaced fifteen seconds apart. The fifteen-second spacing avoids a thundering-herd race on game startup.

Takeaways

01 One learner, N actor proxies over one TCP port. The port is the whole interface.
02 WEIGHT_PULL_EVERY=16 == EPISODES_PER_UPDATE=16. PPO's clip handles mild staleness.
03 workers.txt is the fleet registry. A flat name=ip file.
04 data/workers.json refreshes every 5s. Per-worker CPM, episodes, connected flag.
05 restart_workers.bat is the rollout mechanism. Sequential SSH + schtasks per machine.

Technical details

The whole distributed stack for this project is smaller than the file that describes the reward function. One script, one port, one config file per side, one scheduled task per rollout. Most of what's interesting about this setup is the fact that the interesting part lives elsewhere — the learner's PPO loop, the encoder, the reward shaping. For the constraints this project runs under — eleven personal laptops on a home LAN, one learner on my desk — the compute layer's job is to get episodes into and weights out of the learner, and anything more structural than that would be ceremony over a TCP socket.