Pratik Kate đŸ‘šđŸ»â€đŸ’»

Jun 27, 2025 ‱ 3 min read

Code Blue Thinking for Engineers On-Call 📟

How to Approach Incidents with the Precision of an ER Doctor

Code Blue Thinking for Engineers On-Call 📟

Disclaimer: This example in the article is purely hypothetical. All scenarios, features, systems, and observations are fictional. They are not drawn from my current professional work. But they mimic fictionally.

It Starts With a Page at 2:17 AM

You’re asleep. Dead tired. Then the pager buzzes.

A feature : let’s call it Feature-X has been in early access for just a few hours. Nothing dramatic. Just a limited rollout to 5% of users. But something’s wrong. Alarms are going off. Errors are surging. Telemetry dashboards are redlining.

You check your phone. The alert says:

🚹 Feature-X Activation Failure Rate > 65% (Threshold: 5%)

You sigh, rub your eyes, and reach for your laptop. You’ve been paged to the ER.

Because currently as a Platform Solutions Engineer, your role isn’t just to write code or configure systems. Tonight, you’re the on-call doctor for a critical system in distress.


The ER (Emergency Response) Mindset: Triage Under Pressure

You don’t have full context. You’re not fully awake. But you don’t panic.

Just like an emergency physician facing an unfamiliar patient in critical condition, you begin with triage:

  • What’s failing?

  • Where’s the most damage?

  • Can it be stabilized?

  • What logs are vital signs, and which are just noise?

This isn’t about heroics. It’s about protocol, calm decision-making, and smart use of limited time.

Apply The 5Ws Approach

  1. Who is affected? A small segment of users behind a feature flag.

  2. What is broken? A workflow that triggers downstream actions isn’t completing.

  3. When did it begin? Roughly 20 minutes after the rollout of a backend toggle.

  4. Where is it happening? Mostly in high-latency environments.

  5. Why is it failing? Working theory: premature timeouts triggered by overly strict retry policies.


Diagnosing the Patient: A Structured Approach

Drawing from real-world medical frameworks like SOAP (Subjective, Objective, Assessment, Plan), here’s how the platform “diagnosis” flows.

1. Subjective (Symptoms)

You start with what telemetry is telling you:

  • Feature-X isn’t completing successfully.

  • Specific user flows are timing out.

  • Users are reattempting and abandoning mid-process.

2. Objective (Vitals)

The hard data confirms it:

  • Latency spikes.

  • Backend dependency unresponsive.

  • Logs show repeated “no acknowledgment” errors.

3. Assessment (Diagnosis)

Correlation points to a recently pushed config reducing timeout from 1500ms to 300ms. In lab conditions it worked. In the real world, not so much.

The system is crashing because it’s not giving its helper services enough time to respond, cutting the oxygen too early, if you will.

4. Plan (Treatment)

  • Immediate rollback of the timeout config.

  • Temporary alert silencing to prevent noise fatigue.

  • Post-incident checklist created to assess user impact.

Just like in a trauma bay, stabilisation comes first. Then comes diagnosis. Then comes systemic prevention.


What Makes This Job Like Being an ER Doctor?

  • Unknowns: You’re rarely given full context up front. You rely on instincts + data.

  • Urgency: Incidents can’t wait until morning. They escalate. You need to act.

  • Human Impact: Behind every failed feature is a user trying to get something done.

  • Team Coordination: You page specialists (infra, backend, client) just like calling in surgeons or neurologists.

  • Aftercare: Fixing the symptom isn’t enough. You plan to prevent recurrence.


Post-Op: Learning From the Incident

By 5:00 AM, the feature is back to normal. Errors have dropped. Your heart rate too.

But the job isn’t over. You open a doc to write the postmortem, not to assign blame, but to document what the system was trying to tell you before it flatlined.

You note:

  • Config changes must be tested in high-latency environments.

  • Observability gaps in the dependency’s retry logic.

  • Need for better alert granularity to catch degradation earlier.

Because like every ER team, you don’t just treat the patient - you make the hospital better.


This Is Fiction, But It Feels Familiar

This story is made up. But if you’ve been on-call, you know the feeling. You’re not just fixing systems, you’re responding to emergencies, stabilising chaos, and preventing future breakdowns.

You’re an on-call engineer, but when the lights go red and the alerts flood in, you become something more: đŸ©ș The ER Doctor.

Join Pratik on Peerlist!

Join amazing folks like Pratik and thousands of other builders on Peerlist.

peerlist.io/

It’s available... this username is available! 😃

Claim your username before it's too late!

This username is already taken, you’re a little late.😐

0

1

0