Real incidents. Real resolutions.

How OnCallReady resolves real incidents —
in seconds, with no human in the loop.

Three anonymized case studies from production environments. Full timelines, agent reasoning, and the counterfactual: what would have happened with PagerDuty and a sleeping engineer.

38s
Network cascade resolution
24s
OOM payment service
19s
Expired TLS on public API
0
humans paged across all three
Network / Egress 38s MTTR Series B fintech · 180 engineers · US-East-1

Network degradation cascade — 2:47am

DataDog fired a P1 network latency spike. Agent correlated with a deploy 16 minutes earlier, identified a bad egress rule, rolled back, and posted an incident summary to Slack. No engineer was paged.
T+0s — 02:47:03
P1 alert received from DataDog
Trigger: p99 egress latency 4,200ms (threshold 500ms) on us-east-1 payment-gateway
T+3s — 02:47:06
Agent queries deploy log
Last deploy: payment-gateway v2.14.1 at 02:31:44 by CI pipeline (PR #4187)
T+7s — 02:47:10
Diff against baseline network config
Found: egress rule added in v2.14.1 routes all outbound traffic through new VPC NAT gateway not yet attached to transit gateway
T+12s — 02:47:15
Confirmed impact scope
Checked 6 downstream services — 3 degraded (checkout, fraud-check, kyc-verify), 3 unaffected. Breach estimated at $1,100/min at current error rate.
T+19s — 02:47:22
Executed egress rule revert
Applied pre-rollback runbook: reverted egress rule to v2.13.x baseline via Terraform apply (targeting network module only)
T+31s — 02:47:34
Validated recovery
p99 latency returned to 180ms. All 3 degraded downstream services confirmed healthy via synthetic probes.
T+38s — 02:47:41
Posted incident summary to #incidents-prod Slack
Summary: root cause, actions taken, services affected, duration, PR flagged for review. Ticket auto-created in Jira with full trace.
[02:47:06] QUERY deploy_log WHERE service='payment-gateway' AND ts > NOW()-INTERVAL '30m'
[02:47:06] FOUND v2.14.1 deployed 02:31:44 — 16m before alert
[02:47:07] DIFF network/egress.tf v2.13.x → v2.14.1
[02:47:08] DELTA +route: 0.0.0.0/0 via nat-gateway-07f3c8 (not attached to tgw-prod)
[02:47:10] SCOPE services affected: checkout(ERR 23%), fraud-check(TIMEOUT 41%), kyc-verify(TIMEOUT 38%)
[02:47:22] EXEC terraform apply -target=module.network -var egress_rule=baseline
[02:47:29] APPLY complete — 1 resource updated
[02:47:34] VALID p99=180ms ✓ checkout=OK ✓ fraud-check=OK ✓ kyc-verify=OK ✓
[02:47:41] CLOSE incident RESOLVED — duration 38s — 0 humans paged
Agent reasoning excerpt
"The latency spike pattern — sudden onset, all outbound traffic affected, no change in inbound traffic — pointed to an egress routing problem rather than an application regression. I checked the deploy log from 2:31am and saw the egress rule change introduced in PR #4187. The diff confirmed a new NAT gateway route that wasn't wired into the transit gateway yet. This is a known misconfiguration class — the fix is a targeted Terraform revert, not a full service rollback. I confirmed impact scope across downstream services first to make sure nothing worse was happening before I applied the change. Recovery validated within 5s of apply completing."
With PagerDuty
2:47am — PagerDuty fires page
2:52am — Engineer sees page, joins incident bridge (+5min)
3:04am — Deploy log identified as root cause (+12min)
3:19am — Egress rule identified (+15min)
3:31am — Revert executed, services recover (+12min)
44 minutes total — ~$48k in potential SLA breach
With OnCallReady
2:47am — Alert received by agent
2:47am — Deploy log queried automatically
2:47am — Root cause identified (egress diff)
2:47am — Revert applied, validated
2:47am — Slack summary posted
38 seconds total — SLA breach avoided ($42k saved)
38s
MTTR
$42k
SLA breach avoided
0
engineers woken
3
downstream services recovered

"We had a 38-second incident that would have been a 44-minute incident. Our on-call engineer found out at 9am when they read the Slack summary. That's the whole product right there."

— VP Engineering, Series B fintech (anonymized)
Memory / OOM 24s to stabilization Post-Series A SaaS · 95 engineers · GKE

Runaway pod OOM-killing payment service

Prometheus fired a memory pressure alert at 11:14pm. Agent diffs against baseline, identifies a memory leak in a newly-deployed worker, scales replicas while the patch is prepared, and opens a GitHub PR with the proposed fix.
T+0s — 23:14:07
P1 alert received from Prometheus
Trigger: payment-worker-v3 pod memory 94% of limit (3.8/4.0 GiB), OOMKill events: 3 in last 90s
T+2s — 23:14:09
Diff pod spec vs baseline
payment-worker-v3 deployed 47 minutes ago (23:27 build pipeline). Previous stable version: payment-worker-v2 (7 days uptime, no OOM events).
T+6s — 23:14:13
Memory growth profile analysis
Heap growth rate: +22MB/min, linear slope. Pattern matches unbounded event listener accumulation — not a traffic spike. Confirmed not a node-level issue (co-located pods unaffected).
T+9s — 23:14:16
Immediate stabilization: scale replicas
Applied kubectl scale: payment-worker-v3 replicas 2→6. New pods start cold with clean heap while leaking pods continue draining queue.
T+24s — 23:14:31
Service stabilized — OOMKill rate drops to 0
Queue depth returning to baseline. Error rate on payment processing: 0.0%. Replicas handling load correctly.
T+41s — 23:14:48
GitHub PR opened with proposed fix
PR #4203: "fix: remove unbounded EventEmitter in payment-worker processQueue() — add listener cleanup on job completion". Diff targeted, test coverage note included.
[23:14:09] DEPLOY payment-worker-v3 → 23:27:14 build #8841 (PR #4198 merged 23:19)
[23:14:10] DIFF src/workers/paymentWorker.js +47 lines vs v2
[23:14:11] FLAG processQueue(): EventEmitter created per job, removeListener() never called
[23:14:12] HEAP growth_rate=+22MB/min slope=linear (not spike) → leak confirmed
[23:14:13] RISK OOMKill in ~8min at current rate → 3 pods active → payment processing failure
[23:14:16] SCALE kubectl scale deploy/payment-worker-v3 --replicas=6
[23:14:31] STABLE OOMKill_rate=0 payment_errors=0.0% queue_depth=normal
[23:14:48] PR github.com/org/repo/pull/4203 — fix: unbounded EventEmitter cleanup
Agent reasoning excerpt
"Linear heap growth at a constant rate rules out a traffic spike — if it were traffic-driven, growth would correlate with request volume. This is a classic listener leak: something is being registered on an EventEmitter per operation without cleanup. The diff on payment-worker-v3 showed 47 new lines in processQueue() and EventEmitter instantiation inside the job handler with no corresponding removeListener. I didn't roll back the deploy because the service was still processing payments successfully — the right move was stabilize first, fix second. Scaling to 6 replicas bought time. The PR is the permanent fix; an engineer can review and merge when they wake up."
With PagerDuty
11:14pm — PagerDuty pages on-call
11:20pm — Engineer joins, starts triage (+6min)
11:26pm — OOMKill cascade begins, payment errors spike (+6min)
11:34pm — Memory leak hypothesis (+8min)
11:48pm — Manual scale-up applied (+14min)
34 minutes of payment errors — Customers impacted
With OnCallReady
11:14pm — Alert received by agent
11:14pm — Memory leak identified in diff
11:14pm — Replicas scaled to 6
11:14pm — Service stabilized (24s)
11:14pm — PR opened with fix
0 payment errors — Customers never noticed
24s
to stabilization
0%
payment error rate
1
PR auto-opened with fix
0
engineers paged

"The agent caught the leak, scaled us up, and opened a fix PR — all before I would have even seen the alert. I merged the PR over coffee the next morning. This is what I want my on-call rotation to look like."

— Staff Engineer, Post-Series A SaaS (anonymized)
TLS / Certificates 19s MTTR Enterprise SaaS · 320 engineers · Multi-region

Expired TLS cert on public API — caught before customer impact

A synthetic check failed on the public API endpoint. Agent confirms via curl, pulls a fresh cert from Vault, hot-rotates via cert-manager, validates the chain end-to-end, and closes the ticket. Customers never noticed.
T+0s — 04:03:11
Synthetic check failure received
Trigger: api.company.io SSL handshake failure on synthetic monitor (5 consecutive failures, 30s interval)
T+2s — 04:03:13
Independent confirmation via curl
curl -sv https://api.company.io → SSL certificate expired 04:02:47 UTC (expired 24 seconds ago). Certificate: *.company.io, issued by Let's Encrypt, valid until 2026-06-04T04:02:47Z
T+5s — 04:03:16
Pull fresh certificate from Vault
vault read pki/issue/api-company-io → new cert issued, valid 90 days (expires 2026-09-02). Chain validated locally (root + intermediate).
T+9s — 04:03:20
Apply via cert-manager CertificateRequest
kubectl apply -f certrequest-api-company-io.yaml → cert-manager picked up new cert, updated TLS secret in api-gateway namespace.
T+14s — 04:03:25
Nginx ingress reload triggered
cert-manager annotation triggered ingress controller reload. New cert serving on all 3 replicas confirmed.
T+19s — 04:03:30
Chain validation and ticket close
Verified full chain (leaf → intermediate → root). Synthetic monitor green. Zero customer-facing errors detected in 19s window. Ticket closed with audit trail.
[04:03:13] EXPIRED *.company.io cert expired 04:02:47 UTC (24s ago) — Let's Encrypt
[04:03:14] CHECK customer traffic: 0 errors logged (cert expired during low-traffic window)
[04:03:16] VAULT pki/issue/api-company-io → new cert, expires 2026-09-02, chain OK
[04:03:20] APPLY kubectl apply certrequest — cert-manager updated TLS secret
[04:03:25] RELOAD ingress replicas: 3/3 serving new cert
[04:03:29] VALID openssl verify chain=OK depth=2 CN=*.company.io expires=2026-09-02
[04:03:30] CLOSE incident RESOLVED — duration 19s — 0 customers affected — 0 humans paged
Agent reasoning excerpt
"The cert expired 24 seconds before I received the alert — the synthetic monitor caught it in the first check after expiry. I confirmed independently with curl before touching anything; synthetic monitors can have false positives. The curl output was unambiguous: certificate expired, exact timestamp. I checked customer error logs first — zero SSL errors had surfaced in customer traffic, likely because the expiry hit at 4am during the low-traffic window. That gave me about 45 minutes before US-East business hours traffic would hit this hard. The Vault path was pre-configured in the runbook. Pull, issue, apply, validate chain — this is a deterministic procedure. I didn't need to wake anyone up."
With PagerDuty
4:03am — PagerDuty pages on-call
4:09am — Engineer identifies SSL failure (+6min)
4:14am — Confirms cert expiry (+5min)
4:22am — Locates Vault runbook (+8min)
4:35am — New cert issued and applied (+13min)
32 minutes downtime — Early customers hit SSL errors
With OnCallReady
4:03am — Synthetic failure detected
4:03am — curl confirms expiry
4:03am — Fresh cert pulled from Vault
4:03am — cert-manager applies and reloads
4:03am — Chain validated, ticket closed
19 seconds — Customers never noticed
19s
MTTR
0
customers impacted
90d
new cert validity
0
engineers woken

"Our SRE team has been burned by cert expiry twice in the past 18 months. Both times were 3am pages, 30+ minute incidents. This was 19 seconds. I didn't even know it happened until I read the audit log the next day."

— Director of Platform Engineering, Enterprise SaaS (anonymized)

Three incidents. Three different stacks. Zero humans paged.

See it run on your stack →

15-min demo · No credit card · Your alert source, your runbooks

Or see pricing → — pay per resolved incident, not per engineer.

See it run on your stack — book a 15-min demo.

Bring your worst 3am incident. We'll show you how the agent handles it.

Book demo →