High

Queue Backlog Remediation

A growing message queue means stalled workers or poison-pill messages. OnCallReady diagnoses the cause, restarts stalled consumers, purges poison messages, and scales processors to drain the backlog — before it cascades into an outage.

Avg Resolution
31s
Severity
High
Success Rate
95%
Humans Paged
0

Trigger Conditions

/queue.*(backlog|depth|growing|backed.up|lag|consumer|stall)/i

Triggers on queue depth threshold alerts from SQS, RabbitMQ, Kafka, Redis queues, or custom monitoring. Typical: "SQS queue depth 4,200 messages", "Queue consumer lag: 8,000 events", "Kafka consumer group falling behind", "Job queue backlog exceeds threshold".

What the Agent Does

1

Measure queue depth and consumer rate

Checks current queue depth and consumer throughput over the last 5 minutes. Calculates whether the queue is growing (consumer down) or stable (just burst traffic).

2

Check consumer health

Verifies consumer worker processes are alive and processing. Detects stalled consumers (0 messages/min despite queue depth), crashed workers, and poison-pill messages blocking the consumer.

3

Restart stalled consumers

For stalled workers: performs rolling restart. For poison-pill detection (same message retried >5 times): moves message to DLQ, restarts consumer, resumes processing.

4

Scale up processors if needed

If backlog is large (>1000 messages) and consumers are healthy, increases worker replica count to drain the queue faster. Auto-scales back down once queue clears.

5

Confirm queue draining

Monitors queue depth every 5 seconds. Marks resolved when depth returns below threshold and consumer rate is positive. Escalates if depth continues growing after intervention.

Example Incident Log

incident-5487 · queue-backlog · email-send-queue
[14:22:09] ALERT Queue backlog: email-send-queue depth 4,821 messages
[14:22:09] Matched runbook: queue-backlog
[14:22:10] Consumer rate: 0 msg/min (was 340 msg/min) — consumer stalled
[14:22:11] Checking email-worker health...
[14:22:12] email-worker: last heartbeat 4m ago · suspected crash
[14:22:12] Restarting email-worker (2 replicas)
[14:22:19] Workers healthy · Consumer rate: 380 msg/min
[14:22:19] Queue draining (backlog: 4,821 → 4,440 → 3,200...)
[14:22:30] Scaling up: +2 replicas to accelerate drain
[14:22:40] ✓ RESOLVED Queue depth: 4,821 → 142 · Duration: 31s
[14:22:40] On-call: undisturbed. email-worker crash root cause logged.

Stop queue backlogs from cascading into outages

OnCallReady catches and clears queue issues before they impact users. See it live.