When CPU usage goes critical, OnCallReady finds the runaway process, throttles or terminates the offender, triggers horizontal scaling if the load is legitimate, and verifies the system recovers — all before your phone rings.
Fires on CPU-related alert payloads indicating high utilization or runaway processes. Handles load average spikes, per-process CPU ceiling breaches, and cloud autoscaler warnings. Typical: "CPU at 98% on worker-node-4", "High load average: 24.8", "Process consuming 340% CPU".
Samples CPU usage across all processes for 5 seconds. Identifies processes above the 80% per-core threshold. Differentiates legitimate load spikes (traffic burst) from runaway processes (infinite loop, deadlock).
If spike is from a single process with abnormal CPU duration, classifies as runaway. If spike is distributed across workers with normal request patterns, classifies as traffic burst and triggers scaling instead.
For runaway: applies cpulimit throttling first (non-destructive). If CPU stays above threshold after 10s, sends SIGTERM. Logs the process command-line and user for post-incident review.
For legitimate load: calls the cluster autoscaler or container orchestrator API to add instances. Waits for new nodes to become healthy before marking load balanced.
Monitors CPU for 30 seconds post-action. Confirms usage drops below 70%. Resolves incident with action summary. Escalates if CPU remains elevated despite intervention.
Whether it's a runaway job or a traffic burst, OnCallReady handles it. See a live demo.