Available now

Prometheus + OnCallReady: AI Incident Response for Prometheus Alerts

OnCallReady registers as an Alertmanager receiver, intercepting Prometheus alerts before they reach PagerDuty — executing the fix autonomously and posting a silence or annotation back so your dashboards stay accurate and your phone stays quiet.

How it works

Alertmanager routes matching alerts to OnCallReady's webhook receiver. OnCallReady resolves the incident, then calls the Alertmanager API to create a silence so the alert doesn't re-fire during remediation.

┌──────────────┐ ┌─────────────────────┐ ┌──────────────────────────────┐ │ Prometheus │ │ Alertmanager │ │ OnCallReady │ │ fires rule │───▶│ routes to receiver │───▶│ ① Ingest alert │ │ │ │ │ │ ② Match runbook │ └──────────────┘ └─────────────────────┘ │ ③ Execute fix (28s avg) │ │ ④ POST silence to AM │ │ ⑤ Annotate Grafana dashboard │ └──────────────────────────────┘

Signal → Action table

Alertmanager alert name	Runbook triggered	Autonomous action
NodeDiskSpaceRunningLow	Disk Full Remediation	Purge logs & tmp, verify free space, post silence for 2h
NodeMemoryUsageHigh	Memory Exhaustion	Drop page cache, restart OOM candidates, confirm stable
HighCPULoad	CPU Spike	Identify offending process, throttle or kill, scale if needed
PostgresConnectionsNearMax	DB Connection Pool	Terminate idle connections, rolling restart of pool manager
ServiceDown	Service Restart & Recovery	Drain from LB, restart with health check gate, re-add to rotation
TLSCertExpiringSoon	SSL Certificate Renewal	ACME renewal, cert deploy, web server reload

Setup in 3 minutes

Add OnCallReady as a receiver in your Alertmanager config. The route below sends all non-critical alerts to OnCallReady first, escalating only if the resolution fails.

alertmanager.yml

# alertmanager.yml route: group_by: ['alertname', 'cluster'] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: 'oncallready' routes: # Critical alerts still page humans immediately - match: severity: critical receiver: 'pagerduty-critical' receivers: - name: 'oncallready' webhook_configs: - url: 'https://oncallready.polsia.app/api/alerts' send_resolved: true http_config: bearer_token: 'YOUR_API_KEY' # OnCallReady posts silence back to Alertmanager on success. # Failed resolutions auto-escalate to your fallback receiver.

What stays on-call

OnCallReady handles the automatable incidents. These still page a human:

Alerts tagged severity: critical — route these to your existing escalation path
Infrastructure events with no matching runbook — escalated with full Prometheus label context
Runbook actions that fail on first attempt — immediate escalation with diagnostic data
Multi-node cluster degradation where fixing one node could destabilize the quorum
Any alert where the fix would require a deploy or config change — those need a human decision

Most Prometheus teams also use Grafana for dashboards. Connect both:

Grafana → Datadog → Slack → Runbook library → Live demo →

Cut your Alertmanager noise in half

Most teams resolve 60–80% of alerts automatically within the first week.

View pricing → Watch live demo