Critical

Service Restart & Recovery

When a service crashes or stops responding, OnCallReady detects the failure, collects crash evidence, performs a safe restart, verifies full health — and does it all before the first human would even be aware there was a problem.

Avg Resolution
17s
Severity
Critical
Success Rate
99%
Humans Paged
0

Trigger Conditions

/service.*(down|crash|restart|fail|unavailable|not.respond|unreachable)/i

Fires on any service availability alert. Handles health check failures, process crashes detected by the OS, container exit events, and synthetic monitoring failures. Typical: "Service payment-api is down", "Health check failing on /health", "Process exited with code 1", "Container restart loop detected".

What the Agent Does

1

Confirm service is actually down

Probes the health endpoint from 3 different network locations to rule out monitoring false positives. Confirms failure before taking any action. Prevents unnecessary restarts of healthy services.

2

Capture pre-restart evidence

Collects last 100 log lines, crash dump if present, and process exit code. Stores evidence for post-incident review before the restart clears the runtime state.

3

Remove from load balancer rotation

Drains the unhealthy instance from the load balancer backend pool before restart. Prevents in-flight requests from hitting a service in the process of coming up.

4

Restart with exponential backoff

Issues a graceful restart (SIGTERM → wait 5s → SIGKILL if still running). If restart fails twice, waits 10s before attempting again. Prevents crash-loop storms from cascading.

5

Verify health and re-add to rotation

Polls health endpoint until it returns 200 for 3 consecutive checks. Re-adds service to load balancer. Confirms traffic is flowing. Closes incident with full timeline.

Example Incident Log

incident-5718 · service-restart · payment-api
[02:03:17] ALERT Service payment-api is not responding
[02:03:17] Matched runbook: service-restart
[02:03:18] Confirming failure from 3 probe locations...
[02:03:19] us-east-1 probe: timeout · us-west-2: timeout · eu-west-1: timeout
[02:03:19] Confirmed down · Capturing crash evidence
[02:03:20] Last log: "FATAL: Unhandled exception in payment processor" (02:03:12)
[02:03:20] Removing payment-api from LB rotation
[02:03:21] Drained from load balancer
[02:03:21] Restarting payment-api
[02:03:26] Process started (PID 31094)
[02:03:27] Polling /health... 200 OK · 200 OK · 200 OK
[02:03:29] Re-adding to load balancer
[02:03:34] ✓ RESOLVED payment-api serving traffic · Duration: 17s
[02:03:34] On-call: undisturbed. Crash dump saved for dev review.

Services that fix themselves in seconds

OnCallReady restarts, verifies, and re-routes — all before you'd finish reading the alert. See it live.