The Backup That Lived on What It Protected

After building out proper backups for my databases, I asked the obvious follow-up: is the cluster itself backed up? Not the apps — the cluster. Two pieces: the Rancher control plane, and etcd, which holds every workload definition, secret, and scrap of config in my production cluster.

Rancher was fine — a daily backup to the NAS, swept off-site. But etcd had a quiet problem. RKE2 was dutifully snapshotting it every few hours… onto each control-plane node’s own local disk. Nowhere else. The backup of the cluster was living on the very disks it was meant to save me from. If a node’s disk died, its snapshots died with it. A backup that shares fate with the thing it protects isn’t really a backup.

So: ship etcd snapshots off-site to Backblaze B2. Simple enough. Except it taught me two things worth writing down.

Don’t hand-edit what a robot owns

My instinct was to SSH in and add the S3 config to the RKE2 config file. Good thing I checked first — the cluster is Rancher-provisioned, and rancher-system-agent owns that file. Hand-edit it and the agent reconciles it right back, or worse, you get a drift fight you didn’t sign up for.

The right move was to set it at the source: a cloud credential in Rancher’s store, referenced from the cluster’s provisioning object. Rancher then pushed the config to all three control-plane nodes itself — cleanly, no node ever leaving the Ready state, etcd quorum never at risk. Configure the system through its own front door, not the back.

When the tool won’t take your nice key

I’d been giving each backup job a tightly-scoped cloud key — locked to a single folder, blind to everything else. I tried the same for etcd. RKE2 rejected it: Access Denied.

Turns out RKE2’s S3 client insists on checking whether the bucket exists before it’ll write, and that check needs a bucket-level permission my folder-scoped key didn’t have. You can’t always make the tool meet your security model. So I gave etcd its own dedicated bucket and a key scoped to that whole bucket. Same isolation — a leaked etcd key still can’t touch anything else — just at bucket granularity instead of folder. The principle survived; the implementation bent to fit the tool.

Proof, not hope

As always, the last step is the only one that counts: I triggered a snapshot using the freshly-reconciled config and watched a 36MB etcd snapshot land in B2. Then I wrote down the restore command, because a backup you don’t know how to restore is just a file with good intentions.

Three layers now: rebuild Rancher, rebuild the cluster from etcd, restore the data inside it — each with an off-site copy that doesn’t share a disk, a NAS, or a fate with what it protects.

Now I can finally go do the thing this whole detour was blocking: actually patching the cluster.