Case Studies — How OnCallReady Resolves Real Incidents in Seconds

Network / Egress 38s MTTR Series B fintech · 180 engineers · US-East-1

Network degradation cascade — 2:47am

DataDog fired a P1 network latency spike. Agent correlated with a deploy 16 minutes earlier, identified a bad egress rule, rolled back, and posted an incident summary to Slack. No engineer was paged.

Agent timeline

T+0s — 02:47:03

P1 alert received from DataDog

Trigger: p99 egress latency 4,200ms (threshold 500ms) on us-east-1 payment-gateway

T+3s — 02:47:06

Agent queries deploy log

Last deploy: payment-gateway v2.14.1 at 02:31:44 by CI pipeline (PR #4187)

T+7s — 02:47:10

Diff against baseline network config

Found: egress rule added in v2.14.1 routes all outbound traffic through new VPC NAT gateway not yet attached to transit gateway

T+12s — 02:47:15

Confirmed impact scope

Checked 6 downstream services — 3 degraded (checkout, fraud-check, kyc-verify), 3 unaffected. Breach estimated at $1,100/min at current error rate.

T+19s — 02:47:22

Executed egress rule revert

Applied pre-rollback runbook: reverted egress rule to v2.13.x baseline via Terraform apply (targeting network module only)

T+31s — 02:47:34

Validated recovery

p99 latency returned to 180ms. All 3 degraded downstream services confirmed healthy via synthetic probes.

T+38s — 02:47:41

Posted incident summary to #incidents-prod Slack

Summary: root cause, actions taken, services affected, duration, PR flagged for review. Ticket auto-created in Jira with full trace.

Agent log excerpt

[02:47:06] QUERY deploy_log WHERE service='payment-gateway' AND ts > NOW()-INTERVAL '30m'
[02:47:06] FOUND v2.14.1 deployed 02:31:44 — 16m before alert
[02:47:07] DIFF network/egress.tf v2.13.x → v2.14.1
[02:47:08] DELTA +route: 0.0.0.0/0 via nat-gateway-07f3c8 (not attached to tgw-prod)
[02:47:10] SCOPE services affected: checkout(ERR 23%), fraud-check(TIMEOUT 41%), kyc-verify(TIMEOUT 38%)
[02:47:22] EXEC terraform apply -target=module.network -var egress_rule=baseline
[02:47:29] APPLY complete — 1 resource updated
[02:47:34] VALID p99=180ms ✓ checkout=OK ✓ fraud-check=OK ✓ kyc-verify=OK ✓
[02:47:41] CLOSE incident RESOLVED — duration 38s — 0 humans paged

Agent reasoning excerpt

"The latency spike pattern — sudden onset, all outbound traffic affected, no change in inbound traffic — pointed to an egress routing problem rather than an application regression. I checked the deploy log from 2:31am and saw the egress rule change introduced in PR #4187. The diff confirmed a new NAT gateway route that wasn't wired into the transit gateway yet. This is a known misconfiguration class — the fix is a targeted Terraform revert, not a full service rollback. I confirmed impact scope across downstream services first to make sure nothing worse was happening before I applied the change. Recovery validated within 5s of apply completing."

Without OnCallReady (PagerDuty + on-call engineer)

With PagerDuty

2:47am — PagerDuty fires page

2:52am — Engineer sees page, joins incident bridge (+5min)

3:04am — Deploy log identified as root cause (+12min)

3:19am — Egress rule identified (+15min)

3:31am — Revert executed, services recover (+12min)

44 minutes total — ~$48k in potential SLA breach

With OnCallReady

2:47am — Alert received by agent

2:47am — Deploy log queried automatically

2:47am — Root cause identified (egress diff)

2:47am — Revert applied, validated

2:47am — Slack summary posted

38 seconds total — SLA breach avoided ($42k saved)

38s

MTTR

$42k

SLA breach avoided

engineers woken

downstream services recovered

"We had a 38-second incident that would have been a 44-minute incident. Our on-call engineer found out at 9am when they read the Slack summary. That's the whole product right there."

— VP Engineering, Series B fintech (anonymized)

Memory / OOM 24s to stabilization Post-Series A SaaS · 95 engineers · GKE

Runaway pod OOM-killing payment service

Prometheus fired a memory pressure alert at 11:14pm. Agent diffs against baseline, identifies a memory leak in a newly-deployed worker, scales replicas while the patch is prepared, and opens a GitHub PR with the proposed fix.

Agent timeline

T+0s — 23:14:07

P1 alert received from Prometheus

Trigger: payment-worker-v3 pod memory 94% of limit (3.8/4.0 GiB), OOMKill events: 3 in last 90s

T+2s — 23:14:09

Diff pod spec vs baseline

payment-worker-v3 deployed 47 minutes ago (23:27 build pipeline). Previous stable version: payment-worker-v2 (7 days uptime, no OOM events).

T+6s — 23:14:13

Memory growth profile analysis

Heap growth rate: +22MB/min, linear slope. Pattern matches unbounded event listener accumulation — not a traffic spike. Confirmed not a node-level issue (co-located pods unaffected).

T+9s — 23:14:16

Immediate stabilization: scale replicas

Applied kubectl scale: payment-worker-v3 replicas 2→6. New pods start cold with clean heap while leaking pods continue draining queue.

T+24s — 23:14:31

Service stabilized — OOMKill rate drops to 0

Queue depth returning to baseline. Error rate on payment processing: 0.0%. Replicas handling load correctly.

T+41s — 23:14:48

GitHub PR opened with proposed fix

PR #4203: "fix: remove unbounded EventEmitter in payment-worker processQueue() — add listener cleanup on job completion". Diff targeted, test coverage note included.

Agent log excerpt

[23:14:09] DEPLOY payment-worker-v3 → 23:27:14 build #8841 (PR #4198 merged 23:19)
[23:14:10] DIFF src/workers/paymentWorker.js +47 lines vs v2
[23:14:11] FLAG processQueue(): EventEmitter created per job, removeListener() never called
[23:14:12] HEAP growth_rate=+22MB/min slope=linear (not spike) → leak confirmed
[23:14:13] RISK OOMKill in ~8min at current rate → 3 pods active → payment processing failure
[23:14:16] SCALE kubectl scale deploy/payment-worker-v3 --replicas=6
[23:14:31] STABLE OOMKill_rate=0 payment_errors=0.0% queue_depth=normal
[23:14:48] PR github.com/org/repo/pull/4203 — fix: unbounded EventEmitter cleanup

Agent reasoning excerpt

"Linear heap growth at a constant rate rules out a traffic spike — if it were traffic-driven, growth would correlate with request volume. This is a classic listener leak: something is being registered on an EventEmitter per operation without cleanup. The diff on payment-worker-v3 showed 47 new lines in processQueue() and EventEmitter instantiation inside the job handler with no corresponding removeListener. I didn't roll back the deploy because the service was still processing payments successfully — the right move was stabilize first, fix second. Scaling to 6 replicas bought time. The PR is the permanent fix; an engineer can review and merge when they wake up."

Without OnCallReady (PagerDuty + on-call engineer)

With PagerDuty

11:14pm — PagerDuty pages on-call

11:20pm — Engineer joins, starts triage (+6min)

11:26pm — OOMKill cascade begins, payment errors spike (+6min)

11:34pm — Memory leak hypothesis (+8min)

11:48pm — Manual scale-up applied (+14min)

34 minutes of payment errors — Customers impacted

With OnCallReady

11:14pm — Alert received by agent

11:14pm — Memory leak identified in diff

11:14pm — Replicas scaled to 6

11:14pm — Service stabilized (24s)

11:14pm — PR opened with fix

0 payment errors — Customers never noticed

24s

to stabilization

payment error rate

PR auto-opened with fix

engineers paged

"The agent caught the leak, scaled us up, and opened a fix PR — all before I would have even seen the alert. I merged the PR over coffee the next morning. This is what I want my on-call rotation to look like."

— Staff Engineer, Post-Series A SaaS (anonymized)

TLS / Certificates 19s MTTR Enterprise SaaS · 320 engineers · Multi-region

Expired TLS cert on public API — caught before customer impact

A synthetic check failed on the public API endpoint. Agent confirms via curl, pulls a fresh cert from Vault, hot-rotates via cert-manager, validates the chain end-to-end, and closes the ticket. Customers never noticed.

Agent timeline

T+0s — 04:03:11

Synthetic check failure received

Trigger: api.company.io SSL handshake failure on synthetic monitor (5 consecutive failures, 30s interval)

T+2s — 04:03:13

Independent confirmation via curl

curl -sv https://api.company.io → SSL certificate expired 04:02:47 UTC (expired 24 seconds ago). Certificate: *.company.io, issued by Let's Encrypt, valid until 2026-06-04T04:02:47Z

T+5s — 04:03:16

Pull fresh certificate from Vault

vault read pki/issue/api-company-io → new cert issued, valid 90 days (expires 2026-09-02). Chain validated locally (root + intermediate).

T+9s — 04:03:20

Apply via cert-manager CertificateRequest

kubectl apply -f certrequest-api-company-io.yaml → cert-manager picked up new cert, updated TLS secret in api-gateway namespace.

T+14s — 04:03:25

Nginx ingress reload triggered

cert-manager annotation triggered ingress controller reload. New cert serving on all 3 replicas confirmed.

T+19s — 04:03:30

Chain validation and ticket close

Verified full chain (leaf → intermediate → root). Synthetic monitor green. Zero customer-facing errors detected in 19s window. Ticket closed with audit trail.

Agent log excerpt

[04:03:13] EXPIRED *.company.io cert expired 04:02:47 UTC (24s ago) — Let's Encrypt
[04:03:14] CHECK customer traffic: 0 errors logged (cert expired during low-traffic window)
[04:03:16] VAULT pki/issue/api-company-io → new cert, expires 2026-09-02, chain OK
[04:03:20] APPLY kubectl apply certrequest — cert-manager updated TLS secret
[04:03:25] RELOAD ingress replicas: 3/3 serving new cert
[04:03:29] VALID openssl verify chain=OK depth=2 CN=*.company.io expires=2026-09-02
[04:03:30] CLOSE incident RESOLVED — duration 19s — 0 customers affected — 0 humans paged

Agent reasoning excerpt

"The cert expired 24 seconds before I received the alert — the synthetic monitor caught it in the first check after expiry. I confirmed independently with curl before touching anything; synthetic monitors can have false positives. The curl output was unambiguous: certificate expired, exact timestamp. I checked customer error logs first — zero SSL errors had surfaced in customer traffic, likely because the expiry hit at 4am during the low-traffic window. That gave me about 45 minutes before US-East business hours traffic would hit this hard. The Vault path was pre-configured in the runbook. Pull, issue, apply, validate chain — this is a deterministic procedure. I didn't need to wake anyone up."

Without OnCallReady (PagerDuty + on-call engineer)

With PagerDuty

4:03am — PagerDuty pages on-call

4:09am — Engineer identifies SSL failure (+6min)

4:14am — Confirms cert expiry (+5min)

4:22am — Locates Vault runbook (+8min)

4:35am — New cert issued and applied (+13min)

32 minutes downtime — Early customers hit SSL errors

With OnCallReady

4:03am — Synthetic failure detected

4:03am — curl confirms expiry

4:03am — Fresh cert pulled from Vault

4:03am — cert-manager applies and reloads

4:03am — Chain validated, ticket closed

19 seconds — Customers never noticed

19s

MTTR

customers impacted

90d

new cert validity

engineers woken

"Our SRE team has been burned by cert expiry twice in the past 18 months. Both times were 3am pages, 30+ minute incidents. This was 19 seconds. I didn't even know it happened until I read the audit log the next day."

— Director of Platform Engineering, Enterprise SaaS (anonymized)

For engineering teams migrating off OpsGenie or evaluating autonomous response

See what OnCallReady would have done
to your last 3am incident.

Drop your email — we'll replay a scenario that matches your stack and incident history, and show you exactly what the agent does before anyone gets paged.

No credit card. Free 15-min demo. Your data never sold.

Three incidents. Three different stacks. Zero humans paged.

See it run on your stack →

15-min demo · No credit card · Your alert source, your runbooks

Or see pricing → — pay per resolved incident, not per engineer.

How OnCallReady resolves real incidents —
in seconds, with no human in the loop.

Network degradation cascade — 2:47am

Runaway pod OOM-killing payment service

Expired TLS cert on public API — caught before customer impact

See what OnCallReady would have done
to your last 3am incident.

See it run on your stack — book a 15-min demo.

How OnCallReady resolves real incidents —in seconds, with no human in the loop.

Network degradation cascade — 2:47am

Runaway pod OOM-killing payment service

Expired TLS cert on public API — caught before customer impact

See what OnCallReady would have doneto your last 3am incident.

See it run on your stack — book a 15-min demo.

Try the runbook playgroundfor free — no setup required.

You're in.

How OnCallReady resolves real incidents —
in seconds, with no human in the loop.

See what OnCallReady would have done
to your last 3am incident.

Try the runbook playground
for free — no setup required.