Kubernetes, the boring way — Axio Intelligence

Kubernetes is not a problem. Kubernetes set up by someone who’s no longer on the team is a problem.

Most of the k8s clusters we inherit have the same two flaws: they were assembled from a hundred good blog posts, and the person who assembled them is gone. The cluster runs, mostly. Upgrades terrify everyone. Nobody can quite explain why ingress works the way it does.

Here’s the stack we default to when we bootstrap a cluster, and the rules of thumb that come with it. The goal is a cluster a team of 8 can run without a dedicated platform engineer.

The base

Managed control plane. EKS, GKE, or AKS. Self-managed control planes are a discipline. Most teams don’t need that discipline.
One cluster per environment, per region. Not one cluster with namespaces for prod/staging/dev. The blast radius and IAM boundaries are too important.
Node pools split by workload. A general pool, a memory-intensive pool, a spot pool for stateless workloads. Use Karpenter (EKS) or the native cluster autoscaler. Don’t pre-size for peak.

Bootstrap with IaC, then never touch it manually

The cluster, node groups, IAM, networking, KMS keys — all in Terraform/OpenTofu. The in-cluster bits — controllers, ingress, monitoring agents — installed via a bootstrap pipeline that the IaC kicks off.

After bootstrap, the only thing that changes the cluster is GitOps. kubectl apply from a laptop is forbidden. If you can’t do it through a PR, you don’t do it.

GitOps with ArgoCD or Flux

We use ArgoCD more often because the UI is helpful when you’re explaining the system to a new team member. Flux is great if you prefer everything to be CRD-driven and don’t need the dashboard.

The structure we like:

gitops/
  apps/                # one folder per app, with kustomize overlays per env
    api/
      base/
      overlays/
        staging/
        prod/
  platform/            # cluster-wide things
    cert-manager/
    external-dns/
    ingress-nginx/
    kube-prometheus/

ArgoCD watches the apps/ and platform/ paths. Engineers ship by opening a PR against the overlays. Promotions between environments are PRs, not buttons.

Ingress: pick one, stop arguing

ingress-nginx is the boring choice and the right choice for most teams. AWS Load Balancer Controller if you specifically want ALBs per ingress. Don’t run three ingress controllers because three different teams had opinions on the same week.

If you genuinely need a service mesh — and most teams don’t — Linkerd. It has the lowest operational tax of any mesh we’ve run.

Secrets

Don’t use Kubernetes Secrets as your secret store. Use a real secret manager (AWS Secrets Manager, GCP Secret Manager, or Vault) and pull into the cluster via External Secrets Operator. This way:

Rotation happens in one place
Audit logs are real
Compromising kube doesn’t compromise everything

Observability

The default we ship:

Metrics: Prometheus + Grafana, deployed via kube-prometheus-stack. One Grafana per cluster, federated up if you have many.
Logs: Loki or whatever your team is already paying for (Datadog, Splunk, etc). Don’t run your own Elasticsearch unless logs are your business.
Traces: OpenTelemetry collector → your tracing backend of choice.

The point isn’t the tools. The point is that on day one, every new service automatically gets metrics, logs, and traces without the team having to opt in.

Upgrades

The single most important question we ask a team that has Kubernetes: “what’s your upgrade story?”

If the answer is “we’ll figure it out when we have to” — they’re going to have a bad time. Kubernetes ships a new minor version every four months and EOL’d versions get scary fast.

Our default:

Cadence: upgrade one minor version per quarter.
Process: dev → staging → prod, one week apart each, with a checklist.
Documentation: every upgrade gets a one-page runbook of what changed and what broke.

The first upgrade is painful. The third one is boring. That’s the goal.

Hardening: enough to sleep at night

Pod Security Standards (restricted for app namespaces)
Network policies that default-deny ingress between namespaces
IRSA / Workload Identity instead of node-level IAM
Audit logging shipped off-cluster
A schedule for re-running CIS benchmarks

You don’t have to be CKS-certified to run a safe cluster. You do have to do these five things.

What this gets you

A cluster a small team can run for a year without major incidents. Upgrades that take an afternoon instead of a sprint. A platform that new engineers can understand from a README, not from interviewing the founder.

If your Kubernetes setup currently requires the founder to operate it, that’s the kind of cleanup we do.