Reward Shaping: Every Iteration Got Gamed

What I Learned · 2026-04-18 · 6 min read

The reward function is the single most important design decision in an RL system. Every version I wrote was exploited by the agent in some surprising way, and I had to iterate three times before landing on a reward that produced the behavior I actually wanted.

The Outcome

70–85%

Final Stage 0 win rate

↑ from ~5% at v1

3 Versions

3 Exploited

40+ → 4–8 Turns / Fight

~5% → 85% Win-Rate Lift

I started with the cleanest, most ideologically pure reward function I could write. It did not work. Neither did the second one. The third one is what the agent trains on today. Each one broke in a different way, and each failure taught me something specific about why the intuition “just reward the right outcome” falls apart on long-horizon sequential decisions. What follows is each reward function I shipped, what the agent did with it, and what I replaced it with.

The training loop runs in a patched Godot build of Slay the Spire 2, with the agent making card-play and end-turn decisions inside actual combat encounters. A typical fight has 20–30 actions before the combat resolves, which turns out to matter a lot for the first version below.

Version 1

Win/loss terminal reward only

What I did

+1.0 on agent win
-1.0 on loss
0.0 for every intermediate action

The “pure” reward formulation — no hand-engineering telling the agent how to play, just outcome. The appeal is unambiguity: there is no policy bias baked into the reward. The agent has to figure out what leads to wins.

What went wrong

The agent learned nothing useful after thousands of episodes. 98% of early episodes returned -1.0; the policy converged to essentially random card play. The gradient signal had to flow back through 20–30 timesteps via a value function — the part of the network that learns to estimate how good the current situation is — that did not yet know anything, and there was no positive signal to anchor the learning to.

~0%useful behavior

This is textbook credit assignment — figuring out which of those 20–30 actions caused the final outcome. At the start of training the agent wins maybe 1–2% of the time, so 98% of gradient updates push against -1.0 everywhere, with a value function that has no idea which actions were actually causal and which were incidental. The policy collapsed into randomness because that was the only signal the value head could fit to. Dense reward was the obvious next move — give the agent something to fit to on every step, not just at the end of the fight, so the value function has a chance to develop a sense of which states are better than which other states.

Version 2

Per-step HP preservation

What I did

Lose HP → small negative, proportional to HP lost
Block incoming damage → 0.0
Enemy dies → small positive, proportional to enemy HP

The idea: give the agent a signal on every step about something it actually controls. Defensive plays get rewarded, damaging hits get penalized, kills get credit. The per-step density fixes credit assignment — now the value function has something to fit to between terminal rewards.

What went wrong

The agent discovered a stalling exploit. Ending a turn in Slay the Spire 2 means the enemy attacks, and attacks cost HP. The HP-preservation reward made the agent deeply averse to ending turns at all. It learned to play out its whole hand — useless cards included — just to delay the end of turn.

40–50%but stalling

Win rate climbed from random to around 40–50% on Stage 0. Episode length climbed with it. Fights that a reasonable player finishes in 5–6 turns were running 40+. The agent was technically maximizing its per-turn reward — it was taking less damage per turn than a normal player would — but the fights dragged on so long that debuffs accumulated and it lost via attrition. This is textbook reward hacking: the reward function said “minimize HP loss per step,” and the stalling loop does minimize HP loss per step. The reward just did not say “minimize HP loss per fight.” Time was a free resource and the agent exploited it.

The signal that v2 was broken did not come from the win-rate graph — which actually looked plausible for a while, climbing from random into the 40s — it came from the episode-length graph, which started creeping and then spiking. That was useful to notice because it suggested the fix before I had even fully diagnosed the exploit: if episodes are long, make episodes expensive.

Version 3

Step damage + turn penalty + terminal

What I did

Step reward — -(damage_taken / max_hp) * 2.0 on every transition (see reward.py::step_reward); always ≤ 0
Turn penalty — -0.01 per turn end (reward.py::TURN_PENALTY)
Terminal — win scales with HP remaining: +1.0 + hp_remaining/max_hp (range +1.0 to +2.0); loss is -1.0

The turn penalty is small enough that the agent still prefers playing defense over taking damage, but large enough that indefinitely stalling becomes net-negative. It has to win efficiently, not just survive.

What worked

Win rate climbed to 70–85% on Stage 0 and the agent stopped stalling. Episode length dropped from 40+ turns to the 4–8 turn range a human player would see.

70–85%Stage 0 · shipped

What I’d do first next time

Start with the dense + terminal hybrid from day one. Three components together:

Small step-level rewards that encode immediate feedback (HP damage, blocking, enemy kills).
A small penalty per turn to discourage stalling.
A terminal reward for win/loss.

The insight I lacked initially: pure outcome-based rewards do not work for sequential decision problems with long horizons unless you already have a decent value function. The common advice that “you should let the agent figure it out from minimal reward” is a trap at this problem size. Every real RL system I have read about in production uses some form of reward shaping — the only debate is how much.

Anticipate reward hacking. When you add any reward component, immediately ask: how could an adversarial agent exploit this? The HP-preservation reward was exploitable because it did not account for the passage of time. If I had red-teamed it mentally for five minutes before training, the stalling loop would have been obvious. The agent was not being clever; the reward was being incomplete.

Monitor episode length as a first-class metric. Along with win rate and reward. Wildly long episodes are almost always a sign of reward exploitation or broken termination conditions. It was the episode-length graph that first told me v2 was broken — win rate looked fine for a while.

Keep iteration cycles short. Each of these reward functions trained for an hour or two before I killed it. That was deliberate. There is a real temptation, when you have committed a few hours of GPU time to a run, to let it keep going to justify the sunk cost. I try to resist that. If the reward is broken, more compute makes it more broken, not less.

Technical details

Framework: PPO (Proximal Policy Optimization)
Step reward: -(damage_taken / max_hp) * 2.0 per transition — always ≤ 0, no positive step component (reward.py::step_reward)
Turn penalty: -0.01 per turn end (reward.py::TURN_PENALTY)
Terminal: win +1.0 + hp_remaining/max_hp (range +1.0 to +2.0), loss -1.0 — so terminal sits in [-1.0, +2.0]
Discount factor: γ = 0.99
GAE (generalized advantage estimation) lambda: 0.95

Key takeaways

Dense reward beats sparse reward for long-horizon tasks with random initial policies. Credit assignment is too hard with only a terminal reward; the value function cannot bootstrap itself out of the cold start.
Every reward signal can be exploited. Stare at your reward function and ask what a lazy, greedy agent would do to maximize it. Whatever loophole you find, the agent will find it faster.
Watch episode length. Regressions in episode length often reveal reward hacking before win rate does. It is the earliest public signal that the agent has found a degenerate optimum.
Iterate fast, measure everything. Each version trained for only an hour or two before I decided it was broken. The iteration loop matters more than per-iteration perfection — you learn more from three one-hour runs than from one carefully-tuned six-hour run. The mistakes the first run makes are rarely the mistakes the second run needs to make for you to learn the thing.
The reward is not the code that computes it; it is the policy it produces. The only real test of a reward function is what the agent does under it. The prose version of the reward — what I thought I was asking for — is almost always more forgiving than the actual math version the agent gets to see.