Distributed compute — eleven laptops, one TCP port, and a scheduled task
Eleven Windows laptops run an actor proxy each. One PPO learner runs on my desk. They talk over a single TCP port on the home LAN. This post is what the compute actually looks like today — numbers, configs, and the wire.
Eleven laptops. Ten connected. Eight producing. One PPO learner, one TCP port, one restart script. That's the whole distributed surface.
The sibling article over on /learned/distributed-deployment covers how I got the fleet to stay up at all — the schtasks trick, the kill-by-window-title pattern, the three per-worker incidents I kept running into. This post takes the fleet as given and walks through what it's actually doing right now: the configs on each side of the wire, the cadence that keeps them in sync, and the numbers coming out of it.
The fleet today
Each tile above is one Windows laptop running
actor_proxy.py (src/bot/actor_proxy.py)
against the learner on my desk. The IPs are static
name=ip entries from workers.txt — no
dynamic discovery, no service registry:
stsBot0=10.0.0.50
stsBot1=10.0.0.54
stsBot2=10.0.0.73
...
stsBot10=10.0.0.82
Eleven lines. The file's the truth; restart_workers.bat
reads it when I want to roll out a change, and each worker's
workers.txt.example-derived copy tells it where the
learner lives.
The CPM numbers on the tiles come from data/workers.json,
a file the learner rewrites every five seconds. Each entry is one
worker's current state: ip, connected,
episodes_total, combats_per_min,
last_seen_s. A snapshot from a moment earlier today:
[
{"ip": "10.0.0.50", "connected": true, "episodes_total": 33622, "combats_per_min": 25, "last_seen_s": 1.0},
{"ip": "10.0.0.54", "connected": true, "episodes_total": 15793, "combats_per_min": 14, "last_seen_s": 3.0},
{"ip": "10.0.0.77", "connected": true, "episodes_total": 34687, "combats_per_min": 0, "last_seen_s": 1263.0},
{"ip": "10.0.0.70", "connected": true, "episodes_total": 41194, "combats_per_min": 27, "last_seen_s": 2.0},
{"ip": "10.0.0.84", "connected": true, "episodes_total": 39600, "combats_per_min": 37, "last_seen_s": 1.0},
{"ip": "10.0.0.60", "connected": false, "episodes_total": 7217, "combats_per_min": 0, "last_seen_s": 40049.0}
]
Three workers sit at zero CPM in that snapshot. stsBot3 is connected
but hasn't pushed an episode in 1263 seconds — about twenty-one
minutes — so its 60-second rolling CPM has decayed to zero. stsBot5
booted fresh from restart_workers.bat not long before
and only has 420 episodes on the counter. stsBot8 has been
disconnected for 40,049 seconds, which is roughly eleven hours — I
shut its lid yesterday and forgot. The other eight are producing.
Fleet CPM at that moment: 175. Fleet episodes-to-date at that snapshot: about 259,000 (the home-page tile is live and will read higher).
The configs that make this go
Three files set the cadence of the whole system. Two live on the worker side, one on the learner.
[learner]
host = "127.0.0.1"
port = 9998
[actor]
game_port = 9999
headless = true
instances = 3 WEIGHT_PULL_EVERY = 16
_learner_host = "127.0.0.1"
_learner_port = 9998 GAMMA = 0.99
GAE_LAMBDA = 0.95
CLIP_EPS = 0.2
VALUE_COEF = 0.5
ENTROPY_COEF = 0.04
LR = 3e-4
MAX_GRAD_NORM = 0.5
EPISODES_PER_UPDATE = 16
PPO_EPOCHS = 4
MINIBATCH_SIZE = 64 learner.host is the one value that changes per worker,
because on my desk the learner is loopback but on a worker it's the
LAN address of my desk machine. actor.instances is how
many headless game clients the proxy spawns;
actor.game_port is the port the proxy listens on for
those clients. Nothing ever reaches the rest of the network from that
port — it's worker-local.
WEIGHT_PULL_EVERY is the cadence that matters most.
Every sixteen completed episodes, the proxy blocks on a
pull_weights() call and loads the updated policy state
dict. Between pulls, it runs inference with whatever weights it has
locally.
The interesting coincidence is EPISODES_PER_UPDATE and
WEIGHT_PULL_EVERY both being sixteen. The learner waits
for sixteen new episodes to show up on its rollout queue, runs four
PPO epochs over them in minibatches of sixty-four, then the next
actor that asks for weights gets the freshly updated policy. Across
eleven workers producing at mixed rates, some actors will pull
weights several updates behind and some will be right on the
leading edge. That's fine — PPO's clipped objective is what keeps
the stale-policy rollouts usable.
The wire
One TCP port, 9998. That's the whole distributed surface.
11x actor_proxy.py (workers, :9999 to their local games)
│ episodes (one per completed combat)
▼
TCP :9998 ◄── [the only distributed port]
│ weights (pulled every 16 episodes)
▼
learner.py (host PC, single GPU)
No message broker, no Redis, no gRPC, no service mesh. Each actor
opens one long-lived TCP socket to the learner on startup
(_connect_learner(), actor_proxy.py:85),
serializes messages with the in-house protocol.py
framing, and pushes one episode per completed combat.
Episodes up
Inside handle_connection()
(actor_proxy.py:199), after each combat ends, the proxy
builds a dict of
{transitions, summary, completed_at} and calls
push_rollout(ep_dict) (actor_proxy.py:113).
That function grabs _learner_lock — the mutex that
serializes all actor↔learner traffic on the shared socket — sends
the message, and blocks on the learner's ack before releasing the
lock. The ack carries the learner's current update count, which the
proxy uses to decide whether it's time to pull weights.
The reason the mutex matters: a single proxy with multiple game
threads pushes rollouts concurrently. Without the lock, two
send_msg calls would interleave on the wire and the
framing would fall apart. Lock, send, recv, unlock. One request at
a time.
Weights down
Every sixteenth completed episode, pull_weights()
(actor_proxy.py:135) asks the learner for the current
state dict. The learner replies with a payload that includes four
vocab token lists (card, enemy, relic, power — these grow over time
as new entities show up in training), the pickled state dict bytes,
and the current stage_config. The proxy resizes its
local embeddings to match the learner's vocabs
(policy.resize_embeddings, line 164), loads the state
dict, and writes out a fresh stage.json for the C# mod
to pick up.
The weight transfer is the biggest message on the wire — a full state dict is in the megabytes — but it only happens every sixteen episodes per worker. Eleven workers averaging ~16 CPM each means a weight pull every minute or so per worker, staggered across machines. Aggregate learner-side weight-serve load is small.
What the fleet produces
Eleven rows, sorted top to bottom by current CPM:
fleet ≈ 175 cpm · snapshot from data/workers.json
A few things to read off this. stsBot7 is the fastest laptop I've got and routinely runs at the top of the leaderboard. It's a desktop-replacement with better thermals than the rest — the CPM difference is almost entirely thermal. stsBot4, stsBot0, and stsBot10 are second-tier machines that hover in the mid-20s. stsBot1, stsBot2, and stsBot9 are older laptops that throttle harder under sustained load and sit in the low teens.
The three zeros each come from a different state. stsBot3 was
connected at snapshot time but hadn't pushed an episode in over
twenty minutes. Almost certainly the game UI got stuck on a
deck-selection screen the mod couldn't auto-advance; the proxy is
waiting on a combat that isn't coming. stsBot5 had just restarted
and only logged 420 episodes before the snapshot, which is fewer
than one PPO update's worth. stsBot8 is a laptop whose lid I closed
the night before and didn't reopen; connected=false
flipped once the learner's worker-stats deadline passed.
combats_per_min itself is a rolling 60-second window on
timestamps, stored in a deque(maxlen=200) per worker
(WorkerStats._timestamps, learner.py:155).
The learner writes the worker table out to
data/workers.json every five seconds
(_write_workers_loop, learner.py:181), so
the on-disk CPM is never more than five seconds stale.
Rolling out changes
One script handles fleet-wide code updates:
scripts/main/restart_workers.bat. It reads
workers.txt, iterates, and for each worker runs:
taskkillby window title to stop the running proxy and gamesgit fetch origin && git reset --hard origin/mainto pull fresh codeschtasks /create /tn STS2Worker ...then/runthen/deleteto re-launchrun_worker.batin a detached session
The schtasks trick is the weird one and is covered in the
retrospective. The relevant line from
restart_workers.bat:82:
ssh %SSH_USER%@!WIP_%%i! "schtasks /create /tn STS2Worker /tr \"cmd.exe /c cd /d %REMOTE_REPO%\scripts\worker ^& run_worker.bat\" /sc once /st 00:00 /f >nul 2>&1 && schtasks /run /tn STS2Worker >nul 2>&1 && schtasks /delete /tn STS2Worker /f >nul 2>&1" Sequential, one worker at a time. Eleven workers take about a minute end-to-end. The loop is not parallelized — the errors are easier to read one at a time, and for a home LAN the serial cost is cheap enough not to matter.
On each worker, run_worker.bat
(scripts/worker/run_worker.bat) reads
worker_config.toml, installs Python dependencies,
launches the actor proxy in one cmd window (titled
STS2 Actor Proxy), and then launches
actor.instances headless game clients spaced fifteen
seconds apart. The fifteen-second spacing avoids a thundering-herd
race on game startup.
Takeaways
WEIGHT_PULL_EVERY=16 == EPISODES_PER_UPDATE=16. PPO's clip handles mild staleness. workers.txt is the fleet registry. A flat name=ip file. data/workers.json refreshes every 5s. Per-worker CPM, episodes, connected flag. restart_workers.bat is the rollout mechanism. Sequential SSH + schtasks per machine. Technical details
- Workers: 11 Windows laptops,
stsBot0..stsBot10, static IPs on10.0.0.0/24 - Learner: single GPU on the host PC, TCP listener on
0.0.0.0:9998 - Actor-side games:
worker_config.tomlsetsinstancesper worker (default 3, tuned per machine — 2 to 4 depending on thermals) - Protocol: length-prefixed messages over a long-lived TCP socket, one mutex-protected request at a time per actor
- Telemetry cadence:
WorkerStats.combats_per_mincomputed from a 60-second rolling window;data/workers.jsonrewritten every 5 s - Rollout script:
scripts/main/restart_workers.bat— readsworkers.txt, SSH +schtasksper machine, sequential
The whole distributed stack for this project is smaller than the file that describes the reward function. One script, one port, one config file per side, one scheduled task per rollout. Most of what's interesting about this setup is the fact that the interesting part lives elsewhere — the learner's PPO loop, the encoder, the reward shaping. For the constraints this project runs under — eleven personal laptops on a home LAN, one learner on my desk — the compute layer's job is to get episodes into and weights out of the learner, and anything more structural than that would be ceremony over a TCP socket.