Critical

Disk Full Remediation

When disk usage breaches threshold, OnCallReady identifies the top space consumers, rotates stale logs, purges temp files, and verifies the volume is healthy — before any human is paged.

Avg Resolution
28s
Severity
Critical
Success Rate
98%
Humans Paged
0

Trigger Conditions

/disk.*(full|usage|capacity|alert|critical|high|9[0-9]%)/i

Fires on any alert containing disk-related keywords paired with severity indicators. Matches Datadog, CloudWatch, Prometheus, PagerDuty, and custom webhook payloads. Typical triggers: "Disk usage at 92% on prod-web-03", "Volume /data is full", "CRITICAL: Disk capacity exceeded on api-server-1".

What the Agent Does

1

Parse alert metadata

Extracts host, volume path, and current usage percentage from the alert payload. Determines which filesystem is affected.

2

Identify top space consumers

Scans /var/log, /tmp, application cache directories, and container overlay layers. Ranks directories by size to target the highest-impact cleanup actions first.

3

Rotate and compress stale logs

Compresses log files older than 24 hours using gzip. Removes rotated logs beyond the 7-day retention window. Preserves current active log files.

4

Purge temporary files

Clears /tmp directories of files untouched for >6 hours. Removes dangling Docker images and stopped containers if Docker is present. Purges package manager caches.

5

Verify health and report

Re-checks disk usage. If below 80%, marks incident resolved and logs the bytes freed. If above 80%, escalates to on-call with full context and remediation log attached.

Example Incident Log

incident-4821 · disk-full · prod-web-03
[03:14:07] ALERT Disk usage at 93% on prod-web-03 (/dev/sda1)
[03:14:07] Matched runbook: disk-full
[03:14:08] Scanning top consumers on prod-web-03:/dev/sda1...
[03:14:09] Found: /var/log/nginx (14.2 GB), /tmp (3.8 GB), /var/cache/apt (1.1 GB)
[03:14:10] Rotating logs: /var/log/nginx/*.log older than 24h
[03:14:14] Compressed 847 log files → freed 11.3 GB
[03:14:15] Purging /tmp (files >6h old)
[03:14:17] Purged /tmp → freed 3.1 GB
[03:14:18] Clearing apt cache
[03:14:19] Freed 1.1 GB from package cache
[03:14:20] Verifying disk health...
[03:14:21] ✓ RESOLVED Disk usage: 93% → 58% · Freed: 15.5 GB · Duration: 28s
[03:14:21] On-call team: undisturbed. Audit log saved.

Auto-resolve disk alerts in your stack

Connect OnCallReady to your monitoring tools in minutes. No more 3am disk pages.