When a service crashes or stops responding, OnCallReady detects the failure, collects crash evidence, performs a safe restart, verifies full health — and does it all before the first human would even be aware there was a problem.
Fires on any service availability alert. Handles health check failures, process crashes detected by the OS, container exit events, and synthetic monitoring failures. Typical: "Service payment-api is down", "Health check failing on /health", "Process exited with code 1", "Container restart loop detected".
Probes the health endpoint from 3 different network locations to rule out monitoring false positives. Confirms failure before taking any action. Prevents unnecessary restarts of healthy services.
Collects last 100 log lines, crash dump if present, and process exit code. Stores evidence for post-incident review before the restart clears the runtime state.
Drains the unhealthy instance from the load balancer backend pool before restart. Prevents in-flight requests from hitting a service in the process of coming up.
Issues a graceful restart (SIGTERM → wait 5s → SIGKILL if still running). If restart fails twice, waits 10s before attempting again. Prevents crash-loop storms from cascading.
Polls health endpoint until it returns 200 for 3 consecutive checks. Re-adds service to load balancer. Confirms traffic is flowing. Closes incident with full timeline.
OnCallReady restarts, verifies, and re-routes — all before you'd finish reading the alert. See it live.