← What I Learned

Distributed Deployment on Windows — Death by a Thousand Papercuts

What I Learned · 2026-04-18 · 9 min read
The fleet · 11 laptops

Windows is a hostile environment for distributed training, and every approach I tried had a non-obvious failure mode. The thing that eventually worked is weirder than the thing I’d have picked if I started over.

Eleven laptops. Eight of them boot, install, run. Three of them fought me. Those three are the reason this article exists, alongside the launch chain that holds the rest of them together — a chain whose weirdest link is the Windows task scheduler.

This post is about what it took to get that fleet running the same training job reliably, with one-key restarts from my desk. The short version: Microsoft’s officially recommended remote-execution path did not work for me, and the obvious Unix-shaped substitute did not work either. The thing that eventually did work uses a tool nobody would pick up for this purpose on day one.

What I tried first

PowerShell Remoting (WinRM / PSRemoting). Microsoft’s officially recommended answer for running commands on remote Windows machines — think of it as the Windows-native equivalent of ssh user@host. One Enable-PSRemoting on each worker and you’re supposed to be done.

I got it working on one machine in a day. Then I tried to scale.

Blocker

PIN-only login accounts can’t authenticate to WinRM. Most of my laptops use Windows Hello PIN for login, with no password set. PSRemoting wants one of three things: a password I don’t have, a Kerberos + domain join — the corporate-network identity stack where every machine trusts a central server — that my home network doesn’t offer, or CredSSP — a credential-forwarding mode with a long list of hardening exemptions — with exemptions. Each workaround I looked at required applying the same workaround to all eleven machines. That’s eleven points of failure for a setup step. I gave PSRemoting a few more days and then cut it.

The lesson I keep drawing out of this: the “officially correct” answer on Windows has opinions about how the rest of your environment is set up. Personal laptops with PIN login violate those opinions.

Iteration 2: OpenSSH for Windows

Windows ships an OpenSSH server now. Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0, start the service, set it to automatic, done. Public key auth sidesteps the password problem entirely — copy a public key into the worker’s authorized_keys and ssh user@worker just works.

I had reliable remote shell execution by the end of the afternoon.

Blocker

Spawned processes die when the SSH session closes. ssh worker "start_game.bat" launches the game. The game dies the second SSH disconnects. nohup and setsid — the standard Unix incantations for “keep running after I leave” — don’t exist on Windows. start /b didn’t detach cleanly. powershell Start-Process didn’t either. On Windows, processes inherit the login session of their parent, and when the SSH session ends, the session gets cleaned up with its children.

I was looking for a way to tell Windows “start this and forget about me.” The usual Unix idioms didn’t work.

The stack that ended up working

Each link in the chain below earned its place by fixing one of the papercuts above. The weird one is schtasks — the Windows task scheduler — used not for scheduling but as a way to launch a process in a session that will outlive me.

Every link gets its own section below. The first one is the link the chain itself replaces; the rest defend their spot.

restart_workers.bat

A driver script on my desk. Reads workers.txt (a static name=ip list — no dynamic discovery), iterates, and fires one SSH command per worker. Kills old processes, pulls fresh code, re-launches. Sequential, not parallel; eleven workers go in about a minute.

I wrote this later than I should have. Two or three workers is tractable by hand. Past that, manual SSH becomes the dominant cost of any training-loop change you want to roll out.

SSH

The remote-execution transport. Public key auth, which dodges the PIN-account problem. No password, no Kerberos, no CredSSP. It’s the same OpenSSH you use on Linux, carrying the same commands.

Why this link — SSH over PSRemoting

PSRemoting is the “right” answer in a domain environment. On personal machines with PIN login, it’s a non-starter. SSH key auth just works.

schtasks

This is the strange one. schtasks is the Windows task scheduler — /create a one-shot task with /sc once /st 00:00 /f, /run it, /delete it. What I’m actually using it for has nothing to do with scheduling:

schtasks /create /tn STS2Worker /tr "cmd.exe /c run_worker.bat" /sc once /st 00:00 /f
schtasks /run /tn STS2Worker
schtasks /delete /tn STS2Worker /f
Why this link — scheduled tasks run in their own session

They’re independent of the SSH session that scheduled them. When I disconnect, the task keeps running. This is the one built-in Windows mechanism I found that detaches a process from the caller’s session reliably. It is not what the tool is meant for. It works.

cmd.exe + run_worker.bat

The task scheduler launches cmd.exe, which launches run_worker.bat. The batch file’s first meaningful line is title STS2 Worker Main.

Why this link — kill by window title, not PID or command line

To stop a worker I use taskkill /F /FI "WINDOWTITLE eq STS2 Worker Main". Filtering by command line means escaping quotes through SSH; filtering by window title is bulletproof. Setting the title explicitly at the top of the script costs one line and saves a class of fragile shell-quoting bugs.

proxy + games

run_worker.bat spawns one actor proxy process and two to four headless game instances (three by default per worker_config.toml). The games run combat encounters against the current policy; the proxy aggregates their transitions and streams them up to the learner. Weight sync happens the other direction, every 16 episodes.

The specific per-worker count (two to four) is tuned to per-machine thermals — some of the laptops throttle aggressively when all cores run hot, so the fleet is not uniform on that axis either.

Per-worker papercuts

The fleet-grid hero labeled three machines as incidents. Here is what happened to each.

stsBot3 — Windows Defender Application Control blocked the game executable

Device Guard / WDAC is an enterprise code-integrity feature — Windows’ “only run binaries I already trust” policy. The game executable wasn’t signed in a way the local policy on this machine was willing to accept, and the machine isn’t org-joined — the WDAC policy was just enabled at some point, on its own. I tried signature troubleshooting first and got nowhere. What actually worked (documented in docs/worker_commands.md) was disabling DEP/CFG process mitigations — the memory-protection rules Windows applies to every process by default — for the game binary, and turning off the hypervisor launch type — Windows’ virtualization-based security layer — then rebooting:

Set-ProcessMitigation -Name "SlayTheSpire2.exe" -Disable DEP,CFG
bcdedit /set hypervisorlaunchtype off

Acceptable for a personal laptop running modded game binaries. Not acceptable on a corporate machine.

stsBot7 — sts2.dll was zero bytes

After a redeploy, the mod build on this worker started failing with 134 compile errors, all “MegaCrit not found.” dir showed sts2.dll at 0 bytes. The patcher had started rewriting the DLL while a prior game instance still had a file lock, and the truncated write landed on disk. Fix: Steam → STS2 → Properties → Local Files → Verify Integrity. Steam re-downloads the DLL; the patcher works normally on the next run. Root-cause fix — making the patcher refuse to run against a zero-byte target, or back up before writing — is on my list and isn’t done.

stsBot9 — couldn’t resolve the learner’s hostname

Static IP assignment on the learner side was inconsistent with the workers.txt entry this worker was reading. Ordinary DHCP-reservation drift — the router had silently handed the machine a different address than the one I’d pinned. Unglamorous to fix. Worth noting because it’s the second-place finish in “things that were unique to one machine.”

Three incidents, three different root causes, three different fixes. There was no pattern across them except that each one was a local condition of one machine I had not looked at before that moment.

What I’d do first next time

SSH from day one, not PSRemoting. PSRemoting is the correct answer for machines in a domain. For personal laptops with PIN login, it’s a trap. SSH sidesteps the authentication cliff entirely.

Know about schtasks for process detachment. This is the single most useful Windows trick I took out of the project. Any time you need a process to survive the session that started it, a one-shot scheduled task is the answer. It is ugly and it works.

Assume each worker will have one unique failure. Across eleven, I hit one WDAC block, one corrupted DLL, and one hostname-resolution issue. Three of eleven, each different. Plan for per-machine triage — no uniform rollout is going to be uniform.

Build a one-key restart before the fleet gets past two or three. SSH-ing by hand is fine for a few machines. Past that, manual restarts become the dominant cost of every code change.

Set explicit window titles on your batch files. title STS2 Worker Main at the top of the script; taskkill /F /FI "WINDOWTITLE eq …" to stop it. Bulletproof, no quote-escaping, works over SSH.

Takeaways

01 SSH, not PSRemoting
02 schtasks for detachment
03 Each worker a snowflake
04 Restart script before scaling
05 Kill by window title
  1. SSH, not PSRemoting — PIN accounts kill PSRemoting; key auth just works.
  2. schtasks for detachment — the only reliable way to outlive the SSH session.
  3. Assume each worker is a snowflake — plan for per-machine triage, not uniform rollout.
  4. Restart script before scaling — two workers manual is fine, eleven is not.
  5. Kill by window titletaskkill /F /FI "WINDOWTITLE eq …" beats command-line filters.

Technical details

The final stack is not something I’d have designed up front. It’s something I arrived at by running into one wall after another and recording what got me past each one. That’s most of what “operational experience on Windows” turned out to mean for me on this project — a list of walls, and the specific shape of the hole I dug through each of them.