Available now

Prometheus + OnCallReady: AI Incident Response for Prometheus Alerts

OnCallReady registers as an Alertmanager receiver, intercepting Prometheus alerts before they reach PagerDuty — executing the fix autonomously and posting a silence or annotation back so your dashboards stay accurate and your phone stays quiet.

How it works

Alertmanager routes matching alerts to OnCallReady's webhook receiver. OnCallReady resolves the incident, then calls the Alertmanager API to create a silence so the alert doesn't re-fire during remediation.

┌──────────────┐ ┌─────────────────────┐ ┌──────────────────────────────┐ Prometheus │ │ Alertmanager │ │ OnCallReady fires rule │───▶│ routes to receiver │───▶│ ① Ingest alert │ │ │ │ ② Match runbook └──────────────┘ └─────────────────────┘ │ ③ Execute fix (28s avg) ④ POST silence to AM ⑤ Annotate Grafana dashboard └──────────────────────────────┘

Signal → Action table

Alertmanager alert name Runbook triggered Autonomous action
NodeDiskSpaceRunningLow Disk Full Remediation Purge logs & tmp, verify free space, post silence for 2h
NodeMemoryUsageHigh Memory Exhaustion Drop page cache, restart OOM candidates, confirm stable
HighCPULoad CPU Spike Identify offending process, throttle or kill, scale if needed
PostgresConnectionsNearMax DB Connection Pool Terminate idle connections, rolling restart of pool manager
ServiceDown Service Restart & Recovery Drain from LB, restart with health check gate, re-add to rotation
TLSCertExpiringSoon SSL Certificate Renewal ACME renewal, cert deploy, web server reload

Setup in 3 minutes

Add OnCallReady as a receiver in your Alertmanager config. The route below sends all non-critical alerts to OnCallReady first, escalating only if the resolution fails.

alertmanager.yml
# alertmanager.yml route: group_by: ['alertname', 'cluster'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'oncallready' routes: # Critical alerts still page humans immediately - match: severity: critical receiver: 'pagerduty-critical' receivers: - name: 'oncallready' webhook_configs: - url: 'https://oncallready.polsia.app/api/alerts' send_resolved: true http_config: bearer_token: 'YOUR_API_KEY' # OnCallReady posts silence back to Alertmanager on success. # Failed resolutions auto-escalate to your fallback receiver.

What stays on-call

OnCallReady handles the automatable incidents. These still page a human:

Related

Most Prometheus teams also use Grafana for dashboards. Connect both:

Cut your Alertmanager noise in half

Most teams resolve 60–80% of alerts automatically within the first week.