Before you refactor, observe: the SRE reflex that saves sprints

A service is slow. A service falls over twice a week and nobody knows why. The team reflex, almost always, is: “we need to rewrite it.” Someone pulls out a whiteboard, proposes a new architecture, and three sprints later you have a brand-new service with the same problems.

The SRE reflex is the opposite: before you refactor, observe. Not out of dogma — out of experience.

The hidden cost of refactoring without data

Refactoring without data means making three blind bets:

That you’ve identified the right problem. The service is slow? Slow where? On which endpoint? For which kind of request? At what time of day?
That the new architecture will solve that specific problem. Without a baseline, you won’t even know if it’s better afterwards.
That no new problem will appear. Refactor = re-bug. If you didn’t instrument the old one, you can’t compare.

Three bets, three chances to be wrong. Statistically, you lose at least one.

What “observe” concretely means

Observing isn’t “add a few logs.” It’s shipping an observability sub-project that answers precise questions before you touch any business code.

For a slow service, I want at minimum:

Latency per endpoint, per percentile (p50, p95, p99). The mean lies.
Response time decomposition: how much in the DB, how much in external calls, how much in the application logic. Distributed tracing if possible.
Error rate per endpoint and per code (5xx says nothing; 502 vs 504 vs 500 tells different stories)
Resource saturation: CPU, RAM, file descriptors, DB connections
Dependency latency measured client-side (not just the metrics the dependency itself exposes)

For an unstable service:

Structured logs with a propagated request ID to reconstruct a complete journey
Continuous profiling when feasible (Pyroscope, runtime-integrated profiler)
Pre-built incident dashboards, not invented during the 3 a.m. crisis
Replays of problematic requests in pre-prod

The “ah” moment

Every time I’ve taken the trouble to do this, the actual diagnosis has been different from the assumed one.

The “slow” service was slow because one endpoint out of 47 was carrying 80% of traffic and doing an N+1 on the DB. The rest of the service was fine.
The service that “fell over” fell over because a monthly cron job was blowing up memory at a specific moment. Nobody suspected the cron because nobody saw it in the graphs.
The service “saturated” on CPU wasn’t CPU-saturated: it was blocked on DB locks because a running migration was taking ten times longer than expected.

None of these would have been fixed by a generic refactor. All of them were resolved with a targeted, surgical fix that took a few hours of dev.

When to refactor anyway

Observing doesn’t mean never refactoring. It means refactoring with data and a measurable goal. A legitimate refactor looks like:

“The p99 on this endpoint is at 4s, we’re targeting 400 ms; we’ve traced 90% of the time to the serialisation of a giant blob loaded on every request, so we’re going to cache it.”
“We have 12 incidents a month tied to shared-DB coupling between service A and B, so we’re going to isolate the schemas.”
“Cloud costs doubled in six months, we’ve traced 70% of it to verbose logs we never use, so we’re cutting them.”

Three examples, three refactors that work. Why? Because data precedes action, and the target is measurable.

The counter-argument

“But we don’t have time to instrument, it’s urgent.”

It’s precisely when it’s urgent that you instrument first. A sprint lost to observing is cheaper than three sprints lost to refactoring the wrong service. If observability takes two days, those two days are the most profitable you’ll spend on the project.

And if observability is so absent that you need several weeks to put it in place, then that is the real project. Not the refactor.

The one-line summary

When a service is broken, the first delivery should never be business code. It should be the ability to see where it’s broken. The rest follows naturally.