Distributed Deployment on Windows — Death by a Thousand Papercuts
Windows is a hostile environment for distributed training, and every approach I tried had a non-obvious failure mode. The thing that eventually worked is weirder than the thing I’d have picked if I started over.
Eleven laptops. Eight of them boot, install, run. Three of them fought me. Those three are the reason this article exists, alongside the launch chain that holds the rest of them together — a chain whose weirdest link is the Windows task scheduler.
This post is about what it took to get that fleet running the same training job reliably, with one-key restarts from my desk. The short version: Microsoft’s officially recommended remote-execution path did not work for me, and the obvious Unix-shaped substitute did not work either. The thing that eventually did work uses a tool nobody would pick up for this purpose on day one.
What I tried first
PowerShell Remoting (WinRM / PSRemoting). Microsoft’s
officially recommended answer for running commands on remote Windows
machines — think of it as the Windows-native equivalent of
ssh user@host. One
Enable-PSRemoting on each worker and you’re supposed
to be done.
I got it working on one machine in a day. Then I tried to scale.
PIN-only login accounts can’t authenticate to WinRM. Most of my laptops use Windows Hello PIN for login, with no password set. PSRemoting wants one of three things: a password I don’t have, a Kerberos + domain join — the corporate-network identity stack where every machine trusts a central server — that my home network doesn’t offer, or CredSSP — a credential-forwarding mode with a long list of hardening exemptions — with exemptions. Each workaround I looked at required applying the same workaround to all eleven machines. That’s eleven points of failure for a setup step. I gave PSRemoting a few more days and then cut it.
The lesson I keep drawing out of this: the “officially correct” answer on Windows has opinions about how the rest of your environment is set up. Personal laptops with PIN login violate those opinions.
Iteration 2: OpenSSH for Windows
Windows ships an OpenSSH server now. Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0,
start the service, set it to automatic, done. Public key auth sidesteps
the password problem entirely — copy a public key into the
worker’s authorized_keys and ssh user@worker
just works.
I had reliable remote shell execution by the end of the afternoon.
Spawned processes die when the SSH session closes. ssh worker "start_game.bat" launches the game. The game
dies the second SSH disconnects. nohup and
setsid — the standard Unix incantations for
“keep running after I leave” — don’t exist on
Windows. start /b didn’t detach cleanly.
powershell Start-Process didn’t either. On Windows,
processes inherit the login session of their parent, and when the SSH
session ends, the session gets cleaned up with its children.
I was looking for a way to tell Windows “start this and forget about me.” The usual Unix idioms didn’t work.
The stack that ended up working
Each link in the chain below earned its place by fixing one of the
papercuts above. The weird one is schtasks — the
Windows task scheduler — used not for scheduling but as a way to
launch a process in a session that will outlive me.
restart_workers.bat
│
▼
SSH
│
▼
schtasks ◄── [the weird link]
│
▼
cmd.exe
│
▼
run_worker.bat
╱ ╲
▼ ▼
proxy games (2–4) Every link gets its own section below. The first one is the link the chain itself replaces; the rest defend their spot.
restart_workers.bat
A driver script on my desk. Reads workers.txt (a static
name=ip list — no dynamic discovery), iterates, and
fires one SSH command per worker. Kills old processes, pulls fresh
code, re-launches. Sequential, not parallel; eleven workers go in about
a minute.
I wrote this later than I should have. Two or three workers is tractable by hand. Past that, manual SSH becomes the dominant cost of any training-loop change you want to roll out.
SSH
The remote-execution transport. Public key auth, which dodges the PIN-account problem. No password, no Kerberos, no CredSSP. It’s the same OpenSSH you use on Linux, carrying the same commands.
PSRemoting is the “right” answer in a domain environment. On personal machines with PIN login, it’s a non-starter. SSH key auth just works.
schtasks
This is the strange one. schtasks is the Windows task
scheduler — /create a one-shot task with
/sc once /st 00:00 /f, /run it,
/delete it. What I’m actually using it for has
nothing to do with scheduling:
schtasks /create /tn STS2Worker /tr "cmd.exe /c run_worker.bat" /sc once /st 00:00 /f
schtasks /run /tn STS2Worker
schtasks /delete /tn STS2Worker /f They’re independent of the SSH session that scheduled them. When I disconnect, the task keeps running. This is the one built-in Windows mechanism I found that detaches a process from the caller’s session reliably. It is not what the tool is meant for. It works.
cmd.exe + run_worker.bat
The task scheduler launches cmd.exe, which launches
run_worker.bat. The batch file’s first meaningful
line is title STS2 Worker Main.
To stop a worker I use
taskkill /F /FI "WINDOWTITLE eq STS2 Worker Main".
Filtering by command line means escaping quotes through SSH;
filtering by window title is bulletproof. Setting the title
explicitly at the top of the script costs one line and saves a class
of fragile shell-quoting bugs.
proxy + games
run_worker.bat spawns one actor proxy process and two to
four headless game instances (three by default per
worker_config.toml). The games run combat encounters
against the current policy; the proxy aggregates their transitions
and streams them up to the learner. Weight sync happens the other
direction, every 16 episodes.
The specific per-worker count (two to four) is tuned to per-machine thermals — some of the laptops throttle aggressively when all cores run hot, so the fleet is not uniform on that axis either.
Per-worker papercuts
The fleet-grid hero labeled three machines as incidents. Here is what happened to each.
Device Guard / WDAC is an enterprise code-integrity feature —
Windows’ “only run binaries I already trust” policy.
The game executable wasn’t signed in a way the local policy on
this machine was willing to accept, and the machine isn’t
org-joined — the WDAC policy was just enabled at some point, on
its own. I tried signature troubleshooting first and got nowhere. What
actually worked (documented in docs/worker_commands.md)
was disabling DEP/CFG process mitigations — the memory-protection
rules Windows applies to every process by default — for the game
binary, and turning off the hypervisor launch type — Windows’
virtualization-based security layer — then rebooting:
Set-ProcessMitigation -Name "SlayTheSpire2.exe" -Disable DEP,CFG
bcdedit /set hypervisorlaunchtype off Acceptable for a personal laptop running modded game binaries. Not acceptable on a corporate machine.
After a redeploy, the mod build on this worker started failing with
134 compile errors, all “MegaCrit not found.”
dir showed sts2.dll at 0 bytes. The
patcher had started rewriting the DLL while a prior game instance
still had a file lock, and the truncated write landed on disk. Fix:
Steam → STS2 → Properties → Local Files → Verify
Integrity. Steam re-downloads the DLL; the patcher works normally on
the next run. Root-cause fix — making the patcher refuse to
run against a zero-byte target, or back up before writing — is
on my list and isn’t done.
Static IP assignment on the learner side was inconsistent with the
workers.txt entry this worker was reading. Ordinary
DHCP-reservation drift — the router had silently handed the
machine a different address than the one I’d pinned. Unglamorous
to fix. Worth noting because it’s the second-place finish in
“things that were unique to one machine.”
Three incidents, three different root causes, three different fixes. There was no pattern across them except that each one was a local condition of one machine I had not looked at before that moment.
What I’d do first next time
SSH from day one, not PSRemoting. PSRemoting is the correct answer for machines in a domain. For personal laptops with PIN login, it’s a trap. SSH sidesteps the authentication cliff entirely.
Know about schtasks for process detachment.
This is the single most useful Windows trick I took out of the project.
Any time you need a process to survive the session that started it, a
one-shot scheduled task is the answer. It is ugly and it works.
Assume each worker will have one unique failure. Across eleven, I hit one WDAC block, one corrupted DLL, and one hostname-resolution issue. Three of eleven, each different. Plan for per-machine triage — no uniform rollout is going to be uniform.
Build a one-key restart before the fleet gets past two or three. SSH-ing by hand is fine for a few machines. Past that, manual restarts become the dominant cost of every code change.
Set explicit window titles on your batch files. title STS2 Worker Main at the top of the script;
taskkill /F /FI "WINDOWTITLE eq …" to stop it.
Bulletproof, no quote-escaping, works over SSH.
Takeaways
- SSH, not PSRemoting — PIN accounts kill PSRemoting; key auth just works.
schtasksfor detachment — the only reliable way to outlive the SSH session.- Assume each worker is a snowflake — plan for per-machine triage, not uniform rollout.
- Restart script before scaling — two workers manual is fine, eleven is not.
- Kill by window title —
taskkill /F /FI "WINDOWTITLE eq …"beats command-line filters.
Technical details
- Workers: 11 laptops, mix of Windows 10 and 11, mix of PIN and password auth
- Network: home LAN with DHCP reservations, ~1ms latency, occasional WiFi flakiness
- Per-worker load: 2–4 headless game instances + 1 actor proxy (3 by default per
worker_config.toml, tuned per machine for thermals) - Launch chain:
restart_workers.bat→ SSH →schtasks→cmd.exe→run_worker.bat→ proxy + games - Worker discovery: static
workers.txt(name=ipentries), no dynamic discovery - Weight sync: actor proxy pulls fresh weights from the learner every 16 episodes
The final stack is not something I’d have designed up front. It’s something I arrived at by running into one wall after another and recording what got me past each one. That’s most of what “operational experience on Windows” turned out to mean for me on this project — a list of walls, and the specific shape of the hole I dug through each of them.