Who Pushes the Button?

I have an AI agent (Claude Code, or CC) that helps run my homelab. It has its own workstation, kubectl access to both clusters, SSH into every node. So a fair question started nagging at me: in the age of agents, is Ansible dead?

I run a 7-node Kubernetes cluster plus a management box. Keeping them patched — OS packages, kernels, MicroK8s, RKE2, Rancher, Longhorn — is the kind of repetitive, careful, multi-step work that you’d think an agent should just do. Ansible has been the tool for it: a serial: 1 playbook that drains a node, runs apt upgrade, reboots, uncordons, moves to the next. Deterministic, boring, reliable.

But I have a thing now that can read release notes, reason about upgrade order, watch a rollout, and notice when something’s off. Why am I still maintaining YAML playbooks?

So I asked CC to actually think it through — not give me a hot take, but fan out a small swarm of sub-agents to analyze each update domain against my real infrastructure, run an adversarial “why is this dangerous” pass, and synthesize an architecture. Read-only. No touching anything.

What came back changed how I think about where agents belong.

The verdict: Ansible isn’t dead, it’s the wrong layer to kill

Cluster updates are the worst possible place to hand an LLM the steering wheel for the actual mutations. That domain wants determinism, idempotency, version pinning, dry-run, and rollback — exactly the things a freeform agent is bad at. You can’t git diff two agent runs. You can’t replay one. And the failure modes are brutal: drain the wrong node and you can lose etcd quorum; reboot the box running Rancher and you cut off the very API access you’d use to diagnose it.

But the agent layer isn’t useless. It’s at a different altitude. The winning pattern is:

The agent orchestrates and adjudicates. The deterministic tooling executes.

The agent owns judgment and glue: read the release notes, check for breaking changes, decide the order, run the pre-flight, watch the rollout, interpret failures, decide go/no-go, write the summary.
The deterministic layer (Ansible, Fleet, helm, RKE2’s System Upgrade Controller) does every actual mutation, behind explicit gates, with a written rollback.

Ansible doesn’t die. It becomes the hands — and the audit artifact. The agent becomes the eyes and the brain that decides when, and whether, to use them. The thing that gets retired isn’t Ansible. It’s me, babysitting each step.

The boundary, drawn

Here’s the split the analysis landed on, domain by domain:

Domain	Agent owns	Stays deterministic
Worker OS patching	drift report, drain-risk triage, between-node go/no-go, post-run verify	the `serial:1` drain → upgrade → reboot → uncordon loop
etcd / control-plane	etcd health check between members, HOLD if unhealthy	one-at-a-time reboots with a hard quorum gate
RKE2 version	research target vs support matrix, watch the roll	pinned version bump, committed as GitOps
Rancher	verify backup, flag blockers, recommend go/abort	pinned `helm upgrade` + staged rollback
Longhorn	go/no-go health report	human-run, version-pinned, hooks enabled

The rule across every row: the agent may prepare, gate, and propose. The irreversible act — draining past a disruption budget, rebooting, bumping a version — is executed by the deterministic layer against a human-approved, diffable plan.

Then the audit found real landmines

This is where it stopped being theoretical. Because the sub-agents read my actual setup, they didn’t just design — they found things. Live, today, in my homelab:

The dangerous one. My Ansible inventory had all seven cluster nodes in one kubernetes_workers group — including the three that run etcd and the control plane. The serial: 1 worker playbook would happily drain and reboot an etcd member with no quorum check. A three-member etcd cluster tolerates exactly one node down. A routine patch run, on a slightly slow day, could have taken the whole production API offline. That had been sitting there, latent, for over a year.

The footgun. The playbook’s management-host step targeted localhost. Fine when you run it on the management box. But CC moved to its own workstation months ago — run it from there and localhost is the wrong machine. It would apt upgrade the workstation and try to refresh a MicroK8s that isn’t even installed on it.

The quiet rot. The management host’s update step only ever printed “reboot required” — it never rebooted. Result: that box was running a months-old kernel with three newer ones installed-but-unbooted, reboot-required flag set the whole time.

The drift. My own notes said the cluster ran RKE2 v1.34.2. It was actually on v1.34.4. One worker was on a different kernel than its peers, despite a note claiming they were all aligned. The map had quietly diverged from the territory.

None of this came from an agent “deciding” to fix things. It came from an agent reading carefully and reporting what it saw — exactly the judgment layer, doing exactly its job. A linter would never have caught the etcd grouping; it’s not a syntax error, it’s a semantic landmine that requires understanding what those nodes do.

The architecture: who stands where

The fix that fell out of this is mostly about placement. There’s a bootstrapping paradox at the center of cluster self-management: a thing can’t cleanly reboot the ground it’s standing on. An updater pod that drains its own node kills itself. A playbook that reboots the host it’s running on severs its own connection mid-run.

The answer is to put the control point outside the thing it manages.

┌───────────────────────────────────────────────────────────┐
│                     Homelab Network                        │
│                                                            │
│   ┌─────────────┐                                          │
│   │   ccmint    │   external control node                  │
│   │ (CC + the   │   - outside both clusters AND hp01       │
│   │  agent)     │   - runs the playbooks over SSH          │
│   └──────┬──────┘   - can reboot anything without          │
│          │            killing itself                       │
│          │ drives (SSH / kubectl)                          │
│    ┌─────┴───────────────────────┬─────────────────────┐   │
│    ▼                             ▼                     ▼   │
│ ┌──────────┐        ┌──────────────────┐     ┌──────────┐  │
│ │ ubuntu-  │        │  etcd / control  │     │  workers │  │
│ │  hp01    │        │  plane  (x3)     │     │   (x4)   │  │
│ │ MicroK8s │        │  ── quorum gate, │     │ serial:1 │  │
│ │ Rancher  │        │     by hand      │     │ drain →  │  │
│ │ (mgmt)   │        │     for now      │     │ patch →  │  │
│ └──────────┘        └──────────────────┘     │ reboot → │  │
│                                              │ uncordon │  │
│                                              └──────────┘  │
└───────────────────────────────────────────────────────────┘
        the agent thinks here ▲ ... deterministic tools act ▼

So, concretely, what I built today:

The workstation is the external control node. It’s outside both clusters and outside the management host, so it can drain or reboot anything — including the management box — without rebooting the machine running the playbook. The agent thinks here; the deterministic tools execute from here.
The inventory now separates etcd/control-plane from workers. Two groups. The worker playbook physically cannot touch an etcd member. The landmine is defused at the structural level, not by remembering to be careful.
Two playbooks, split by the reboot paradox. Worker updates and management-host updates are separate, both driven from the external node so nothing reboots itself out from under the runner.
Drain failures halt the run — except the one known, documented exception (a storage disruption-budget timeout I understand and tolerate by name). The old playbook swallowed all drain failures silently. Now an unexpected one stops everything for a human to look.
Uncordon waits for the node to actually report Ready. The old version uncordoned blindly, which would happily schedule workloads onto a node that came back broken.
Source of truth lives in Git. The playbooks now live in a versioned repo. The old reality was two divergent copies on two machines — exactly the drift that lets a fix on one box never reach the other.

What stays manual — on purpose

The phasing matters as much as the architecture. The plan is crawl, walk, run:

Crawl: a read-only pre-flight advisor that checks node health, etcd quorum, storage, version drift, and posts a go/no-go verdict. Zero execution. Pure upside. (It would have flagged every one of the landmines above.)
Walk: gated execution of the lowest-risk domain only — worker OS patching — where a human approves the pinned plan and a fixed runner executes it.
Run: expand to etcd nodes with the hard quorum gate, then version bumps via GitOps — but only after the safe stuff has run cleanly many times.

And some things stay human-driven indefinitely: the etcd/control-plane reboots, the Rancher upgrades (a bad one is a double outage with no peer to fall back to), the Longhorn version jumps. The agent advises on all of them, forever. It never holds the trigger.

That’s the whole philosophy in one line: the agent can prepare the button and tell you whether to push it. For the dangerous ones, you push it.

The meta moment

Yes — CC designed this, found the landmines in my own infrastructure, fixed the playbooks, set up the external control node, committed it to Git, and wrote this post about it. There’s the recursion again.

But notice which parts it did. It read, reasoned, proposed, and documented. It restructured config files — text, reversible, reviewable. What it did not do is drain a node or reboot a host on its own initiative. Every actual mutation to the running cluster is still gated behind a deterministic tool and, for the dangerous tier, behind me.

That’s not a limitation I’m working around. That’s the design. The most useful thing my AI did this week wasn’t pushing a button. It was reading carefully enough to find the button I didn’t know was armed — and then building a system where pushing it is safe.

Ansible’s not dead. It just finally has someone smart enough to know when not to run it.