High

CPU Spike Remediation

When CPU usage goes critical, OnCallReady finds the runaway process, throttles or terminates the offender, triggers horizontal scaling if the load is legitimate, and verifies the system recovers — all before your phone rings.

Avg Resolution

41s

Severity

High

Success Rate

94%

Humans Paged

Trigger Conditions

Fires on CPU-related alert payloads indicating high utilization or runaway processes. Handles load average spikes, per-process CPU ceiling breaches, and cloud autoscaler warnings. Typical: "CPU at 98% on worker-node-4", "High load average: 24.8", "Process consuming 340% CPU".

What the Agent Does

Identify top CPU consumers

Samples CPU usage across all processes for 5 seconds. Identifies processes above the 80% per-core threshold. Differentiates legitimate load spikes (traffic burst) from runaway processes (infinite loop, deadlock).

Classify the spike type

If spike is from a single process with abnormal CPU duration, classifies as runaway. If spike is distributed across workers with normal request patterns, classifies as traffic burst and triggers scaling instead.

Throttle or terminate runaway process

For runaway: applies cpulimit throttling first (non-destructive). If CPU stays above threshold after 10s, sends SIGTERM. Logs the process command-line and user for post-incident review.

Trigger horizontal scale (if traffic burst)

For legitimate load: calls the cluster autoscaler or container orchestrator API to add instances. Waits for new nodes to become healthy before marking load balanced.

Verify CPU recovery

Monitors CPU for 30 seconds post-action. Confirms usage drops below 70%. Resolves incident with action summary. Escalates if CPU remains elevated despite intervention.

Example Incident Log

incident-5389 · cpu-spike · worker-node-4

[04:31:02] ALERT CPU at 97% on worker-node-4

[04:31:02] → Matched runbook: cpu-spike

[04:31:04] Sampling CPU consumers (5s window)...

[04:31:09] report-generator (PID 29471) — 340% CPU · 18min runtime · no output

[04:31:09] → Classified: runaway process (infinite loop suspected)

[04:31:10] → Applying cpulimit to PID 29471 (50% ceiling)

[04:31:20] CPU after throttle: 82% — still above threshold

[04:31:21] → Sending SIGTERM to PID 29471 (report-generator)

[04:31:22] ✓ Process terminated

[04:31:32] Monitoring CPU recovery...

[04:31:43] ✓ RESOLVED CPU: 97% → 31% · Duration: 41s

[04:31:43] On-call: undisturbed. Runaway PID logged for dev team.

Auto-resolve CPU spikes before they page you

Whether it's a runaway job or a traffic burst, OnCallReady handles it. See a live demo.

See it live → ← All runbooks