What I Learned
Lessons from building a reinforcement learning system for a complex card game.
- What I LearnedA Second Head That Silently Corrupted PPO’s Ratio Math2026-04-19
I added a second decision-maker to the network and wired it into the training math the wrong way. Nothing blew up — a training metric just quietly drifted for weeks before I spotted the pattern. The fix was one line of code and a comment explaining why past-me got it wrong.
- What I LearnedReward Shaping: Every Iteration Got Gamed2026-04-18
The reward function is the single most important design decision in an RL system. Every version I wrote was exploited by the agent in some surprising way, and I had to iterate three times before landing on a reward that produced the behavior I actually wanted.
- What I LearnedCurriculum Learning: You Can’t Skip the Basics2026-04-18
I spent weeks trying to train an agent on the full game before giving up and building a curriculum. The curriculum version reached higher performance on the full game in days than my direct approach reached in weeks.
- What I LearnedDistributed Deployment on Windows — Death by a Thousand Papercuts2026-04-19
Windows is a hostile environment for distributed training, and every approach I tried had a non-obvious failure mode. The thing that eventually worked is weirder than the thing I’d have picked if I started over.
- What I LearnedInput Vector Design — One-Hot Was a Trap2026-04-19
How the input vector got from ~1500 dimensions to 583, and how I found the second trap by accident — not by design.
- What I LearnedMonitoring: “Online” Is Not the Same as “Working”2026-04-19
I built a training dashboard early and thought I was monitoring my fleet. I was not. The dashboard happily reported 11 workers “Online” while 6 of them were dead for 47 minutes. Monitoring that doesn’t reflect reality is worse than no monitoring — it gives you false confidence.