A growing message queue means stalled workers or poison-pill messages. OnCallReady diagnoses the cause, restarts stalled consumers, purges poison messages, and scales processors to drain the backlog — before it cascades into an outage.
Triggers on queue depth threshold alerts from SQS, RabbitMQ, Kafka, Redis queues, or custom monitoring. Typical: "SQS queue depth 4,200 messages", "Queue consumer lag: 8,000 events", "Kafka consumer group falling behind", "Job queue backlog exceeds threshold".
Checks current queue depth and consumer throughput over the last 5 minutes. Calculates whether the queue is growing (consumer down) or stable (just burst traffic).
Verifies consumer worker processes are alive and processing. Detects stalled consumers (0 messages/min despite queue depth), crashed workers, and poison-pill messages blocking the consumer.
For stalled workers: performs rolling restart. For poison-pill detection (same message retried >5 times): moves message to DLQ, restarts consumer, resumes processing.
If backlog is large (>1000 messages) and consumers are healthy, increases worker replica count to drain the queue faster. Auto-scales back down once queue clears.
Monitors queue depth every 5 seconds. Marks resolved when depth returns below threshold and consumer rate is positive. Escalates if depth continues growing after intervention.
OnCallReady catches and clears queue issues before they impact users. See it live.