Infrastructure-as-code that survives the team that wrote it

We’ve inherited a lot of Terraform. The pattern is depressingly consistent: a repo started by a founder or an early platform engineer who knew exactly what they meant, then accreted layers as the team grew, then became something nobody wanted to touch.

It’s never the tool’s fault. It’s the architecture. Here’s what we’ve found makes the difference between IaC that ages well and IaC that becomes a museum.

The single biggest mistake: one-state-to-rule-them-all

A surprising number of teams keep everything — networking, databases, IAM, apps, secrets — in a single Terraform state. It works for a while. Then:

A plan takes 4 minutes
An apply locks the state for everyone
One person’s change to a Lambda blocks the database team
A terraform destroy becomes impossible to reason about

The fix is boring: split state by blast radius. Things that change together stay together; things that don’t, don’t.

A typical layout we use:

stacks/
  platform/         # VPC, subnets, IAM baseline, KMS, route53
  data/             # RDS, Redis, S3 buckets, replication
  shared-services/  # observability, secrets, service mesh
  apps/
    api/
    worker/
    web/

Each stack has its own state. Cross-stack dependencies go through terraform_remote_state or, better, a small set of well-known SSM parameters or Secrets Manager entries. Apps don’t know how the VPC is built; they just ask for the subnet IDs.

This one change — splitting state — typically cuts plan times by 10x and lets multiple engineers work in parallel without stepping on each other.

Modules, not abstractions

The second pattern that hurts is the urge to wrap everything in a custom module. “We don’t want people writing raw aws_instance resources.” So now there’s a homegrown company_ec2_instance module that’s a thin wrapper around the AWS one — with three opinions baked in and twelve variables that nobody understands.

A year later, somebody needs to do something the module doesn’t support. They open the module, get confused, copy-paste-modify it inline, and now you have three slightly different versions of the same thing.

Our heuristic: write modules where you have a real, repeated pattern that benefits from being named — a microservice, a Postgres cluster with the team’s standard backup config, an EKS node group with the right tags. Don’t write modules to “wrap” upstream resources. The upstream provider is fine.

State backends + remote execution

If your team has more than two engineers, your state belongs in a remote backend with locking. S3 + DynamoDB works fine. Terraform Cloud, Spacelift, or Env0 work better — they give you per-PR plan visibility and a permission model.

Whatever you pick: nobody applies from their laptop. Period.

The mental shift is: the IaC repo isn’t a script you run, it’s the source of truth that an applier acts on. PRs run plans. Merges run applies. State doesn’t live on anyone’s machine.

The pre-merge `plan` is your code review

Every PR should post a plan output as a comment. Atlantis, Terraform Cloud, and Spacelift all do this out of the box. Once you have it, your reviewers can actually see what’s about to happen — not just what the HCL says, but what AWS is about to do.

This single feedback loop catches more would-be incidents than any policy-as-code framework. It’s also how junior engineers learn what their changes mean.

Drift detection isn’t optional

Manual changes in the console will happen. Somebody will be on-call at 2am, fix a thing, and forget to bring it back into code. Without drift detection, your IaC is silently lying to you, and the next person to plan in that area is going to get a surprise.

Cheap version: a nightly scheduled job that runs terraform plan and posts to Slack if anything’s drifted. Better version: drift detection in Terraform Cloud / Spacelift with auto-remediation policies.

Policy-as-code, applied lightly

OPA / Sentinel are great. They’re also great at making your platform team into the bottleneck if you over-rule them.

The policies that have paid for themselves in our work:

No public S3 buckets without an explicit :public tag
No security groups with 0.0.0.0/0 on non-HTTPS ports
No resources without an owner tag
No RDS without backups enabled

That’s it. Three or four guardrails that catch real mistakes. Save the rest for after the team is mature enough to want them.

Handoff is part of the work

When we finish an IaC rewrite, we don’t just hand over a repo. We hand over:

A README that explains the layout in two pages
A “how to do common things” doc — add a service, add a queue, rotate a secret, bootstrap a new env
An onboarding PR that a new engineer can use to learn by doing
A decision log (docs/decisions/) explaining why things are the way they are

The test for a good IaC codebase is: can a new senior engineer ship a safe change in their first week, without asking anyone? If yes, you’re done. If no, the work isn’t finished.