← What I Learned

Monitoring: “Online” Is Not the Same as “Working”

What I Learned · 2026-04-19 · 7 min read
What the dashboard said
11 / 11 workers online
369 CPM total
What was actually happening
5 / 11 actually producing
112 CPM total
reality →

I built a training dashboard early and thought I was monitoring my fleet. I was not. The dashboard happily reported 11 workers “Online” while 6 of them were dead for 47 minutes. Monitoring that doesn’t reflect reality is worse than no monitoring — it gives you false confidence.

Two bugs, one lesson. The dashboard was wrong about which workers were alive, and it was wrong about how much work they were doing. Both bugs came from measuring the thing that was easy to measure instead of the thing I actually wanted to know. Both produced numbers that looked fine until I did the arithmetic in my head and noticed they couldn’t be.

What follows is the dashboard I built first, the two ways it lied to me, and the two iterations that fixed each lie.

What I built first

Simple dashboard with the obvious metrics. The learner collects episodes from actor proxies over TCP. I added a worker tracker:

@dataclass
class WorkerStats:
    ip: str
    connected: bool = True          # true until socket close
    episodes_total: int = 0
    last_seen: float = field(default_factory=time.time)       # updated when episodes arrive
    _timestamps: deque = field(default_factory=lambda: deque(maxlen=200))

The dashboard showed:

The logic seemed reasonable when I wrote it. connected flips to False when the socket closes cleanly. The timestamps deque gets a new entry every time an episode arrives. last_seen updates with each episode. All four fields are things the learner directly observes. No derived metrics, no heuristics, no inference — just report what you see.

That was the problem. What the learner sees and what the worker is actually doing are not the same thing.

What went wrong

Problem 1: “Online” meant “still connected,” not “still working”

The connected flag only flips to False on a clean socket close. If a worker crashed without closing the socket, lost its LAN connection, hit an unhandled exception in the actor loop, or froze on a game hang — the socket remained in ESTABLISHED state on the learner side for tens of minutes. TCP keepalives eventually detect the dead connection, but the default tcp_keepalive_time is 2 hours on Linux and KeepAliveTime is the same 2 hours on Windows — useless on a 47-minute horizon.

Result: the dashboard showed “Online” for workers that hadn’t produced an episode in 47 minutes.

False confidence

This was worse than useless. When I glanced at the dashboard and saw “11/11 workers online,” I trusted that number. I didn’t dig deeper. Meanwhile more than half my fleet was dead.

Problem 2: CPM was inflated by buffered flushes

The combats-per-minute calculation:

@property
def combats_per_min(self) -> float:
    now = time.time()
    recent = [t for t in self._timestamps if now - t <= 60.0]
    return round(len(recent), 1)

_timestamps was populated at receipt time — when the learner received the episode, not when the episode actually completed on the worker. Usually fine. But with flaky LAN connections, the actor proxy would buffer episodes during disconnects and flush the buffer on reconnect: 20–30 episodes arriving in 2 seconds, all stamped with the same receipt timestamp.

A worker that had been offline for a minute, then briefly reconnected, would show inflated CPM as its buffered backlog landed in a narrow window. The dashboard reported 90 CPM for one worker while others showed 0 — even though all of them should have been roughly equal.

Do the arithmetic

5 visible “active” workers summed to roughly the expected 11-worker total CPM. That couldn’t be right unless each active worker was doing ~2× its normal output, which is physically impossible. I should have noticed earlier.

Iteration 1

Fix the Online detection

What I had

“Online” if connected == True. One source of truth, on the learner, updated only by socket-close events. Clean code; wrong answer.

What I added

A staleness check, computed client-side in the dashboard:

const stale = w.last_seen_s > 120;
const activeNow = w.connected && !stale;
const status = !w.connected
  ? "Offline"
  : (stale ? "Stale" : "Online");

Now a worker has to both (a) hold an open socket and (b) have produced an episode within the last 2 minutes to count as Online. Workers that silently stopped show Stale with a grey dot instead of the green Online badge. The active count and total CPM only sum workers that are truly active.

The 120-second threshold is a heuristic. Slow workers — during a curriculum stage transition, say, when episodes take longer — might briefly show Stale before returning to Online. That’s acceptable: a false Stale is much less dangerous than a false Online. The worst case of a false Stale is that I go look at a worker that’s fine. The worst case of a false Online is that I let six dead workers sit unnoticed for 47 minutes.

120s stale threshold · moved to UI layer
Iteration 2

Fix the CPM attribution

What I had

_timestamps populated at receipt time. One timestamp per episode. Correct when the network is smooth; wrong whenever a buffer flushes.

What I added

Two changes, one on each side of the wire.

On the actor side, stamp the episode with its completion time when pushing to the learner:

ep_dict = {
    "transitions": raw_transitions,
    "summary": summary,
    "completed_at": time.time(),  # added
}

On the learner side, use that timestamp for the CPM window instead of receipt time, falling back to receipt time if the field is missing:

def record_episode(self, completed_at: float | None = None):
    self.episodes_total += 1
    self.last_seen = time.time()  # still receipt time for "last seen"
    self._timestamps.append(
        completed_at if completed_at is not None else self.last_seen
    )

Now when a worker flushes 30 buffered episodes, each episode contributes to the CPM window based on when it was actually completed, not when it arrived. The inflated-CPM-on-reconnect bug is gone.

The ep_dict.get(“completed_at”) default-to-None behavior is load-bearing. It means I don’t have to redeploy all 11 workers simultaneously — the fix rolls out gracefully as each worker gets redeployed. Older actor proxies that don’t include the field keep working on receipt time; the two that do get the accurate window. Backward compat is not glamour work, but it kept me from having to coordinate an 11-machine atomic restart.

0 false-flush CPM spikes

What I’d do first next time

Monitoring should reflect reality, not connection state. The question “is this worker contributing?” is not the same as “is this socket open?” Define “healthy” as “producing recent output,” not “we have a handle to it.” The socket is a necessary condition, not a sufficient one.

Staleness thresholds belong in the UI layer, not the data layer. The learner doesn’t need to know what “stale” means — it just reports last_seen in seconds. The dashboard decides what threshold makes a worker stale. Server stays simple, UI definition stays flexible, and tuning the threshold doesn’t require a learner restart.

Distinguish receipt time from event time. Any metric involving time-windowed counts should use the event’s actual timestamp, not the timestamp of when you saw it. This matters any time there’s a queue, buffer, or network between event and observer — which is, in a distributed system, everywhere. Classic distributed-systems lesson, re-learned the hard way.

Sanity-check the dashboard against basic arithmetic. The inflated-CPM bug was obvious in retrospect: 5 workers summing to roughly an 11-worker total is physically impossible. I should have caught it by eye, but I wasn’t in the habit of doing the math on what I was looking at. Now I am. If the totals don’t factor, something is lying.

Backward-compatible protocol changes save ops pain. .get(“completed_at”) with a None default means I could redeploy workers one at a time instead of coordinating an 11-machine atomic restart. Protocol additions want a default on the consumer side; protocol removals want a deprecation window on the producer side. Neither is free, but neither is expensive if you plan for it.

The lesson behind the lesson: monitoring is a product. Users — me, glancing at a dashboard at 9pm while the cluster runs overnight — consume it the way they consume any other product. If the product says “everything is fine” when everything is not fine, they act on that. The dashboard was not a neutral reporter of facts. It was an agent making claims, and I had accepted its claims without verifying the inference from the data to the claim. Monitoring that doesn’t model what you actually care about is not “minimal” — it’s misleading.

Technical details

Key takeaways

01 · Connected ≠ working
Validate health by observing recent output, not by checking connection state. A socket can be open forever while the process behind it is dead.
02 · Event time, not receipt time
Any queue or buffer between event and observer distorts receipt time. Stamp the event at the source for time-series metrics.
03 · Staleness belongs in UI
Keep the server reporting raw observations; let the presentation layer decide what “fresh enough” means. You’ll change the threshold more than you change the data.
04 · Sanity-check the math
If the per-worker numbers don’t sum to the total the way they physically must, they don’t. The dashboard is lying to you; figure out which layer.
05 · Backward-compat saves ops pain
.get() with defaults when reading new fields lets producer and consumer roll out independently — which matters more than you think when the producer is eleven machines.