How we usually find 30–50% of a cloud bill in the first week

Every cloud cost engagement starts the same way: finance is alarmed, engineering is defensive, and nobody has a complete picture of what’s actually running.

This isn’t because engineers are sloppy. It’s because cloud spend grows in the seams between teams, between environments, and between the things people built and the things they forgot to turn off. Almost every time, the waste is somewhere different from where leadership thinks it is.

Here’s the rough playbook we run in the first week of a typical cost engagement, and what we usually find.

Day 1–2: the inventory you don’t have

The first thing we do is build an inventory of every billable resource, tagged by service, environment, owner, and last-touched. Most teams don’t have this. They have a billing dashboard, which is not the same thing.

What we’re looking for:

Orphans. RDS replicas attached to a primary that was migrated two years ago. ELBs pointing at deregistered targets. EBS volumes not attached to any instance. NAT gateways nobody can explain.
Untagged sprawl. Resources without owners. If nobody owns it, nobody will turn it off. We start here because it’s almost always cheap to delete.
Forgotten environments. A staging cluster that’s been at production scale since a load test in 2023. A dev account someone spun up for a POC three roles ago.

This inventory work isn’t glamorous. It is reliably 5–15% of the bill, just from things that should have been turned off and weren’t.

Day 3–4: the workloads that are wrong-sized

This is where engineering teams usually get tense, because it feels like a critique. It’s not. Workloads are almost always over-provisioned because the cost of being wrong in the other direction is a 3am page.

We look at p95 CPU, p95 memory, and request patterns over 30 days for the top spenders. We’re trying to answer:

Are you paying for the worst case all the time? Most workloads have a 2–4x spread between average and peak. Autoscaling — real autoscaling, not “we have an ASG configured” — closes most of this.
Is your compute the right shape? Memory-optimized instances running CPU-bound workloads. GPU instances doing CPU work. ARM workloads on x86 because Graviton wasn’t a thing when the instance type was picked.
Is your storage tier right? EBS gp2 vs gp3 alone is usually a 20% storage savings. S3 standard for cold archival data. Logs in CloudWatch instead of S3 with a lifecycle policy.

This is also where we usually find Kubernetes clusters running at 18% utilization with no horizontal pod autoscaler, no cluster autoscaler, and node groups sized for a peak that happens 4 hours a week.

Day 5: the commitment + spot conversation

Once we know what your actual baseline is, we can talk about reserved instances, savings plans, and spot. This conversation in the wrong order is what gets people stuck on 3-year commits for the wrong instance families.

Rough heuristic: anything that runs 24/7 should be on a savings plan. Anything stateless and tolerant of interruption should be on spot. Anything with predictable bursts can usually move to scheduled scaling.

For Kubernetes specifically: Karpenter on spot, with a small on-demand base, will typically take a node bill down 50–70% with almost no operational change.

What we ship at end of week one

A short document with three sections:

Cut now. Things we’d delete or resize today. Usually 15–25% of the bill, often more.
Cut next. Architectural changes that take a few weeks but compound. Spot migration, Graviton, storage tiering, data egress.
Stop the bleed. Guardrails so the waste doesn’t grow back. Budget alerts per team, anomaly detection, tag enforcement, a quarterly review cadence.

The handoff isn’t “here’s a slide deck.” It’s PRs against your Terraform, a tagging policy, a budget dashboard, and a list of “next” items prioritized by dollars saved per engineering hour.

What we don’t do

We don’t run optimization tools that promise to do this in the background. They’re a great way to get a 5% improvement and feel like you’re done. The actual savings — the 30–50% — comes from looking at workloads as an engineer who understands what they do, not as a script looking at metrics.

If your cloud bill has quietly become someone’s full-time problem, this is the kind of thing we do.

Day 1–2: the inventory you don’t have

Day 3–4: the workloads that are wrong-sized

Day 5: the commitment + spot conversation

What we ship at end of week one

What we don’t do

We do this work for a living.