Critical

How to resolve disk full — automated runbook

When disk usage breaches threshold, OnCallReady identifies the top space consumers, rotates stale logs, purges temp files, and verifies the volume is healthy — before any human is paged.

Avg Resolution

28s

Severity

Critical

Success Rate

98%

Humans Paged

Trigger Conditions

Fires on any alert containing disk-related keywords paired with severity indicators. Matches Datadog, CloudWatch, Prometheus, PagerDuty, and custom webhook payloads. Typical triggers: "Disk usage at 92% on prod-web-03", "Volume /data is full", "CRITICAL: Disk capacity exceeded on api-server-1".

What the Agent Does

Parse alert metadata

Extracts host, volume path, and current usage percentage from the alert payload. Determines which filesystem is affected.

Identify top space consumers

Scans /var/log, /tmp, application cache directories, and container overlay layers. Ranks directories by size to target the highest-impact cleanup actions first.

Rotate and compress stale logs

Compresses log files older than 24 hours using gzip. Removes rotated logs beyond the 7-day retention window. Preserves current active log files.

Purge temporary files

Clears /tmp directories of files untouched for >6 hours. Removes dangling Docker images and stopped containers if Docker is present. Purges package manager caches.

Verify health and report

Re-checks disk usage. If below 80%, marks incident resolved and logs the bytes freed. If above 80%, escalates to on-call with full context and remediation log attached.

Example Incident Log

incident-4821 · disk-full · prod-web-03

[03:14:07] ALERT Disk usage at 93% on prod-web-03 (/dev/sda1)

[03:14:07] → Matched runbook: disk-full

[03:14:08] Scanning top consumers on prod-web-03:/dev/sda1...

[03:14:09] Found: /var/log/nginx (14.2 GB), /tmp (3.8 GB), /var/cache/apt (1.1 GB)

[03:14:10] → Rotating logs: /var/log/nginx/*.log older than 24h

[03:14:14] ✓ Compressed 847 log files → freed 11.3 GB

[03:14:15] → Purging /tmp (files >6h old)

[03:14:17] ✓ Purged /tmp → freed 3.1 GB

[03:14:18] → Clearing apt cache

[03:14:19] ✓ Freed 1.1 GB from package cache

[03:14:20] Verifying disk health...

[03:14:21] ✓ RESOLVED Disk usage: 93% → 58% · Freed: 15.5 GB · Duration: 28s

[03:14:21] On-call team: undisturbed. Audit log saved.

Symptoms

Disk usage at 93% on prod-web-03 (/dev/sda1)
CRITICAL: Volume /data is full on api-server-1
Datadog: system.disk.in_use > 0.9 for >5m on host prod-db-3
PagerDuty: Disk capacity exceeded on api-server-1 — CRITICAL
Prometheus: node_filesystem_avail_bytes < 2GB on /var/lib
CloudWatch: RDS FreeStorageSpace < 5GB on prod-pg-1

Root cause diagnostic tree

1. df -h — which filesystem is full?
   ├── /var/log partition → rotated logs not trimmed
   ├── /tmp partition → app temp files not cleaned
   └── root volume with /var/lib/docker → container overlay bloat

2. du -shx /* | sort -hr | head -10
   └── Locate the top consumer directory

3. Check file recency:
   ├── find /var/log -mtime +7 -ls
   └── If > 5 GB of >7d logs → logrotate failed

4. Capacity relief:
   ├── Compress with gzip + remove >7d logs
   ├── Prune Docker: docker system prune -af --filter "until=72h"
   └── Last resort: expand the EBS volume + resize2fs

Manual remediation steps

# 1. Identify what's eating the disk
df -h
sudo du -shx /* 2>/dev/null | sort -hr | head -10

# 2. Vacuum rotated journals
sudo journalctl --vacuum-time=7d

# 3. Compress and prune stale app logs
sudo find /var/log -type f -name "*.log" -mtime +1 -exec gzip {} \;
sudo find /var/log -type f -name "*.gz" -mtime +7 -delete

# 4. Trim Docker overlay layers
sudo docker system prune -af --filter "until=72h"

# 5. Clear apt / yum caches
sudo apt-get clean
sudo yum clean all

# 6. Verify usage is back below 80%
df -h /

How OnCallReady's agent handles this

Playground preview

Auto-remediation plan for disk full

Our agent SSHes (or calls over SSM) to the affected host, runs df -h + du -shx /* | sort -hr | head, and ranks the candidate cleanup actions by expected bytes freed. It vacuums systemd journals, gzips stale app logs, prunes Docker overlays, and re-checks usage — escalating only if disk is still > 80% post-cleanup.

Open in Playground →

Related runbooks

Run this runbook automatically

Drain disk pressure and stop the 3am pages. Connect OnCallReady today.

Start free trial → Book a 15-min walkthrough