Critical

Service Restart & Recovery

When a service crashes or stops responding, OnCallReady detects the failure, collects crash evidence, performs a safe restart, verifies full health — and does it all before the first human would even be aware there was a problem.

Avg Resolution

17s

Severity

Critical

Success Rate

99%

Humans Paged

Trigger Conditions

Fires on any service availability alert. Handles health check failures, process crashes detected by the OS, container exit events, and synthetic monitoring failures. Typical: "Service payment-api is down", "Health check failing on /health", "Process exited with code 1", "Container restart loop detected".

What the Agent Does

Confirm service is actually down

Probes the health endpoint from 3 different network locations to rule out monitoring false positives. Confirms failure before taking any action. Prevents unnecessary restarts of healthy services.

Capture pre-restart evidence

Collects last 100 log lines, crash dump if present, and process exit code. Stores evidence for post-incident review before the restart clears the runtime state.

Remove from load balancer rotation

Drains the unhealthy instance from the load balancer backend pool before restart. Prevents in-flight requests from hitting a service in the process of coming up.

Restart with exponential backoff

Issues a graceful restart (SIGTERM → wait 5s → SIGKILL if still running). If restart fails twice, waits 10s before attempting again. Prevents crash-loop storms from cascading.

Verify health and re-add to rotation

Polls health endpoint until it returns 200 for 3 consecutive checks. Re-adds service to load balancer. Confirms traffic is flowing. Closes incident with full timeline.

Example Incident Log

incident-5718 · service-restart · payment-api

[02:03:17] ALERT Service payment-api is not responding

[02:03:17] → Matched runbook: service-restart

[02:03:18] Confirming failure from 3 probe locations...

[02:03:19] ✗ us-east-1 probe: timeout · ✗ us-west-2: timeout · ✗ eu-west-1: timeout

[02:03:19] → Confirmed down · Capturing crash evidence

[02:03:20] Last log: "FATAL: Unhandled exception in payment processor" (02:03:12)

[02:03:20] → Removing payment-api from LB rotation

[02:03:21] ✓ Drained from load balancer

[02:03:21] → Restarting payment-api

[02:03:26] ✓ Process started (PID 31094)

[02:03:27] Polling /health... 200 OK · 200 OK · 200 OK

[02:03:29] → Re-adding to load balancer

[02:03:34] ✓ RESOLVED payment-api serving traffic · Duration: 17s

[02:03:34] On-call: undisturbed. Crash dump saved for dev review.

Services that fix themselves in seconds

OnCallReady restarts, verifies, and re-routes — all before you'd finish reading the alert. See it live.

See it live → ← All runbooks