The Guard That Did Its Job

This is a post-mortem of a small, self-inflicted incident — the kind that’s worth writing down precisely because it ended well. I poked a known-risky thing on my Kubernetes cluster’s storage layer, it failed in exactly the way I’d been warned it would, nothing was lost, and the recovery was a clean two-minute rollback. The interesting parts are why it failed, why that failure was actually correct behavior, and the judgment call I got wrong on the way in.

The setup

My production cluster uses Longhorn for distributed block storage — it’s what holds my databases, my photos, my git server. A year ago, an attempted Longhorn upgrade had failed halfway: it tried to jump two minor versions at once (1.8 → 1.10), which Longhorn doesn’t allow. The upgrade’s own pre-flight check blocked it. But the failure left things in a weird split state — the CRDs (the API schema definitions) had advanced to 1.10, while the actual manager (the running software) stayed on 1.8.

It had been running fine like that for months. But “running fine in an unsupported state” is a time bomb, and I wanted it resolved before it bit me during some future maintenance.

My AI assistant investigated thoroughly and flagged a specific risk — twice — in plain terms: moving the manager forward to 1.9 while the CRDs sit at 1.10 might fail, because a 1.9 manager could choke on schema that’s newer than it understands. The safe path isn’t documented; get confirmation before touching production storage.

I read that, and I said push forward anyway.

The failure

We did the careful version first: cleaned up the year-old “failed” state so the records matched reality (back to a consistent 1.8.1), verified all 21 storage volumes healthy, took a fresh backup. All green. Then we upgraded the manager 1.8 → 1.9.

The manager pods came up… and immediately began crash-looping. The fatal error, from the logs:

Upgrade failed: upgrade resources failed: Replica in version "v1beta2"
cannot be handled as a Replica: strict decoding error:
unknown field "status.evictionRequested"

There it was — the exact risk, made concrete. On startup, the 1.9 manager runs a migration over all the stored objects. It tried to process a Replica and strictly decoded it against the schema — but the schema was the 1.10 CRD, which had removed a field (status.evictionRequested) that the 1.9 code still references. New software, but an even-newer schema with the floor pulled out from under it. Strict decoding refuses to ignore the mismatch, so the manager exits. Repeatedly.

Why this was the software working correctly

Here’s the part I find genuinely satisfying, and the reason this isn’t a bug report.

That status.evictionRequested field wasn’t removed by accident. Longhorn deprecated it in v1.6.0 and deliberately deleted it in v1.10.0 as planned cleanup. The strict-decode failure is the system refusing to silently operate on data whose schema it doesn’t agree with — which is exactly what you want a storage controller to do. A more lenient system would have shrugged, ignored the unknown field, and maybe corrupted state in some subtle way later. Longhorn chose to fall over loudly instead.

So the crash wasn’t a defect. It was a guard. The cluster was in an unsupported state — CRDs two minor versions ahead of the manager — that my own year-old failed upgrade had created, and the guard correctly refused to proceed. Filing a GitHub bug here would be wrong; the right characterization is “operator put the system in an unsupported state, and the safety check caught it.” The most useful thing to do with that is write it down, not report it.

Why nothing was lost

The reason this is a footnote and not a disaster: the data plane and the control plane are separate. longhorn-manager is the orchestrator — it decides where volumes live, handles attach/detach, runs migrations. The actual reading and writing of your bytes is done by separate engine and replica processes that keep running independently. So while the manager was crash-looping, every one of my 21 volumes stayed healthy and every database kept serving. The control plane was down; the storage itself never flinched.

That distinction is the whole reason an incident like this is survivable. If your orchestrator and your data path share fate, a control-plane bug is a data-loss event. When they don’t, it’s a bad ten minutes.

Recovery was a single command: roll the manager back to the 1.8.1 it had been happily running for months. Two minutes later, all seven manager pods were back to healthy, 21 volumes still green, databases none the wiser.

The judgment call I got wrong

I’ll name it plainly, because it’s the actual lesson. My assistant flagged this exact failure mode, specifically, before I ran it. I had backups, I had the data/control-plane separation, I had a rollback path — so the blast radius was controlled, and that’s why I was comfortable. But “the failure is survivable” is not the same as “the failure is worth triggering.” I spent a real (if small) chunk of risk and downtime to learn something I’d been told.

What I got right was the scaffolding: every dangerous step this whole month — the backups, the off-site copies, the restore drills — was built precisely so that when something did go wrong on the storage layer, it would be a shrug instead of a catastrophe. The incident was, in a sense, a live-fire test of all that preparation. It passed.

What I’d do differently: when the person (or the AI) who did the investigation says “this specific thing might break, get confirmation first,” the cost of getting that confirmation is a day’s wait. The cost of skipping it is a production incident. Even a survivable one isn’t free.

Where it stands

The cluster is healthy, back on the stable version, with the year-old mess at least cleaned up to a consistent state. The actual version upgrade is parked until I’ve confirmed the right sequence — almost certainly jumping the manager straight to 1.10 to match the CRDs rather than stepping into the gap. There’s no rush. It’s been fine for a year; it’ll be fine for another week while I do it right.

The backups held. The guard held. The separation held. The only thing that didn’t hold was my patience — and that one’s on me, not the software.