← Field Notes
migrationAWSGCPPostgresinfrastructure

Migrating clouds without the migration freeze

Most cloud migrations stall because the team treats them as a single cutover event. Here's how we run them as a continuous series of safe, reversible steps — usually with zero production downtime.

Brice Ayres /

The reason most cloud migrations slip is that they’re scoped as an event. A weekend. A holiday. A window where the team can take a deep breath, flip the switch, and pray.

We don’t run them that way anymore. After enough migrations, the pattern is clear: the projects that finish on time are the ones where there is no cutover weekend at all. The new system runs alongside the old one for weeks, taking progressively more traffic, until the old system has nothing left to do and you turn it off on a Tuesday afternoon.

Here’s how we structure it.

Step 1: forget about the destination for a minute

Before we talk about which cloud to move to, we map what you actually have.

This sounds obvious. It isn’t. Most teams have a partial picture: the services they ship are well-understood, but the queues, jobs, dashboards, and integrations that grew up around them are not. We’ve inherited migrations that were 80% done when someone discovered a Lambda that nobody owned was pushing critical reports to a third-party SFTP server.

The deliverable for this phase is a single diagram and a CSV. The diagram shows every running thing and what it talks to. The CSV lists every billable resource and its owner. If nobody owns it, that’s a finding.

Step 2: pick the smallest viable first slice

The most common failure mode is “let’s lift-and-shift everything, then optimize later.” This works on whiteboards. It fails in practice because you’re moving everything at once and you have to coordinate everyone at once.

Instead, we pick a slice. Usually it’s:

  • One stateless service
  • Its database (or a logical subset)
  • The CI pipeline that ships it
  • The monitoring + on-call wiring

The slice has to be small enough to ship in 2–3 weeks, and complete enough that you can actually run it in production on the new cloud. The point isn’t to migrate this service — it’s to prove out every supporting piece (IaC, CI, secrets, observability, network) on something low-risk.

Step 3: dual-write, dual-read, then cut

For services with state, we almost always go through a dual-write phase:

  1. Dual-write. New writes go to both old and new databases. Reads still come from the old. We run this until backfill of historical data is complete and the two systems are byte-for-byte consistent.
  2. Shadow read. Reads start hitting both systems. We compare results, log discrepancies, and don’t return the new system’s data to users. This catches every subtle behavior difference — character encoding, sort order, NULL handling — without anyone noticing.
  3. Flip reads. Move the read source. The old database is now a hot backup.
  4. Stop writing to the old. Now you can decommission on your own time.

For Postgres specifically: logical replication via pglogical or AWS DMS, with row-level consistency checks. For Mongo, change streams. For event-sourced systems, replay from your event log and dual-publish.

The point is that at every step, you can roll back to the previous state in seconds, not hours.

Step 4: traffic shifting at the edge

Once a service has its new home, we shift traffic gradually at the load balancer or DNS layer. Usually:

  • 1% for an hour
  • 10% for a day
  • 50% for a few days
  • 100%

If anything goes sideways, the rollback is moving the weight back. There is no “let’s revert the deploy and restore from backup.”

This is also where we catch the long-tail issues you couldn’t catch in staging: cold caches, region-specific latency, IAM policies that work for the test account but not the production one.

Step 5: decommission, then prove it

A migration isn’t done when the new system is taking 100% of traffic. It’s done when the old one is off, the bill is gone, and the IaC for it is deleted from the repo.

We typically wait two weeks after 100% before we tear things down. After that, the AWS account gets a terraform destroy, the IAM users get revoked, and the DNS records get pruned. The PR to remove the old code is sometimes the most satisfying one in the whole project.

What this costs in time

A reasonable benchmark for a single-product company with a moderately complex backend (a dozen services, two databases, a queue or two, a frontend, and CI/CD): 8–14 weeks, with the team continuing to ship features the whole time.

What blows that estimate is almost always the same thing: undocumented dependencies discovered halfway through. The inventory step is the cheapest insurance you can buy.

If you’re staring at a migration plan that has a single weekend cutover on it, we’d recommend re-scoping it. There’s no reason to take that risk.

/ Have a project like this?

We do this work for a living.

30-minute scoping call. No deck, no salesperson — you talk to the engineer who'd be doing the work.