High

Queue Backlog Remediation

A growing message queue means stalled workers or poison-pill messages. OnCallReady diagnoses the cause, restarts stalled consumers, purges poison messages, and scales processors to drain the backlog — before it cascades into an outage.

Avg Resolution

31s

Severity

High

Success Rate

95%

Humans Paged

Trigger Conditions

Triggers on queue depth threshold alerts from SQS, RabbitMQ, Kafka, Redis queues, or custom monitoring. Typical: "SQS queue depth 4,200 messages", "Queue consumer lag: 8,000 events", "Kafka consumer group falling behind", "Job queue backlog exceeds threshold".

What the Agent Does

Measure queue depth and consumer rate

Checks current queue depth and consumer throughput over the last 5 minutes. Calculates whether the queue is growing (consumer down) or stable (just burst traffic).

Check consumer health

Verifies consumer worker processes are alive and processing. Detects stalled consumers (0 messages/min despite queue depth), crashed workers, and poison-pill messages blocking the consumer.

Restart stalled consumers

For stalled workers: performs rolling restart. For poison-pill detection (same message retried >5 times): moves message to DLQ, restarts consumer, resumes processing.

Scale up processors if needed

If backlog is large (>1000 messages) and consumers are healthy, increases worker replica count to drain the queue faster. Auto-scales back down once queue clears.

Confirm queue draining

Monitors queue depth every 5 seconds. Marks resolved when depth returns below threshold and consumer rate is positive. Escalates if depth continues growing after intervention.

Example Incident Log

incident-5487 · queue-backlog · email-send-queue

[14:22:09] ALERT Queue backlog: email-send-queue depth 4,821 messages

[14:22:09] → Matched runbook: queue-backlog

[14:22:10] Consumer rate: 0 msg/min (was 340 msg/min) — consumer stalled

[14:22:11] Checking email-worker health...

[14:22:12] → email-worker: last heartbeat 4m ago · suspected crash

[14:22:12] → Restarting email-worker (2 replicas)

[14:22:19] ✓ Workers healthy · Consumer rate: 380 msg/min

[14:22:19] → Queue draining (backlog: 4,821 → 4,440 → 3,200...)

[14:22:30] Scaling up: +2 replicas to accelerate drain

[14:22:40] ✓ RESOLVED Queue depth: 4,821 → 142 · Duration: 31s

[14:22:40] On-call: undisturbed. email-worker crash root cause logged.

Stop queue backlogs from cascading into outages

OnCallReady catches and clears queue issues before they impact users. See it live.

See it live → ← All runbooks