Critical

Memory Exhaustion Remediation

When memory usage spikes to dangerous levels, OnCallReady profiles running processes, drops reclaimable caches, selectively restarts memory-leaking services, and confirms system stability — no pager needed.

Avg Resolution
34s
Severity
Critical
Success Rate
96%
Humans Paged
0

Trigger Conditions

/memory.*(high|exhausted|critical|leak|oom|out.of|pressure|9[0-9]%)/i

Triggers on memory-related alert keywords combined with severity signals. Handles OOM killer events, RSS percentage thresholds, swap exhaustion, and memory pressure warnings from any monitoring tool. Typical: "Memory usage 94% on api-server-2", "OOM killer triggered", "High memory pressure detected".

What the Agent Does

1

Profile memory consumers

Lists top processes by RSS and VSZ. Identifies processes that have grown beyond their baseline allocation. Flags potential memory leaks based on rate-of-growth over the past 30 minutes.

2

Drop reclaimable caches

Triggers kernel page cache drop (sync; echo 3 > /proc/sys/vm/drop_caches) to immediately reclaim reclaimable memory. Safe — no application data is lost.

3

Restart memory-leaking services

Gracefully restarts services consuming memory above their configured ceiling. Performs rolling restart if multiple instances exist to maintain availability during remediation.

4

Verify service health post-restart

Polls health endpoints for restarted services. Confirms they are serving traffic within 10 seconds. Waits for memory metrics to stabilize below 75%.

5

Log and close or escalate

If memory is back under control, resolves the incident with full remediation detail. If memory remains critical after restart, escalates immediately with process profile and restart logs.

Example Incident Log

incident-5103 · memory-exhaustion · api-server-2
[01:52:14] ALERT Memory usage 94% on api-server-2
[01:52:14] Matched runbook: memory-exhaustion
[01:52:15] Profiling top consumers...
[01:52:16] node api-worker (2.8 GB, +1.4 GB/30min — leak suspected)
[01:52:16] postgres (0.9 GB, stable)
[01:52:17] Dropping kernel page caches
[01:52:18] Reclaimed 0.6 GB from page cache
[01:52:18] Gracefully restarting api-worker (PID 18823)
[01:52:24] api-worker restarted (RSS: 412 MB)
[01:52:25] Polling health endpoint /health...
[01:52:26] Service healthy · HTTP 200
[01:52:28] ✓ RESOLVED Memory: 94% → 51% · Duration: 34s
[01:52:28] On-call team: undisturbed. Memory leak flagged for dev review.

Stop getting paged for memory spikes

OnCallReady handles memory exhaustion before it becomes an outage. See a live resolution.