Your monitoring alert fired. Load average is sitting at 47 on a 4-core server and your SSH session feels like it's running through wet concrete. Before you do anything dramatic — and definitely before you reboot a production server at 2 AM — take a breath. High load average is a symptom, not a diagnosis. This guide will help you figure out what's actually going on.
What load average actually means
Run uptime and you'll see three numbers:
14:32:41 up 42 days, 3:17, 2 users, load average: 3.42, 2.91, 2.15
Those three numbers are the average number of processes in a runnable or uninterruptible state over the last 1 minute, 5 minutes, and 15 minutes. That's it. It's a queue length, not a percentage.
Here's what trips people up: on Linux (unlike other Unix systems), load average counts both processes waiting for CPU and processes waiting for I/O. A server pegged on a slow disk will show a high load average even if your CPUs are mostly idle. This is the single most important thing to understand about load average on Linux — and the reason "high load" requires actual investigation before you know what to do about it.
nproc to check your core count.
First commands to run
When you land on a high-load server, run these in order. Each one narrows down where to look next.
Step 1 — Check how many cores you have
nproc
# or for full CPU topology
lscpu | grep -E '^CPU\(s\)|^Core|^Thread|^Socket'
This sets your baseline. A load of 8.0 means very different things on a 2-core VM versus a 32-core bare metal box.
Step 2 — Look at top
top
The header line is what you're after first:
%Cpu(s): 87.3 us, 6.2 sy, 0.0 ni, 3.1 id, 2.8 wa, 0.0 hi, 0.3 si
Two columns tell the story:
id(idle) — if this is near zero, your CPUs are saturated. You have a CPU-bound problem.wa(I/O wait) — if this is high (above 10–20%), processes are spending their time waiting for disk or network I/O. You have an I/O-bound problem.
Both can be high at the same time on a busy server, but usually one dominates. That's your fork in the road.
Step 3 — Check the process list
While still in top, press M to sort by memory, P to sort by CPU. Look for anything consuming an unexpectedly large share. A runaway process — backup job, log rotation script, cron task — is often the culprit and the easiest fix.
# If you prefer a non-interactive snapshot
ps aux --sort=-%cpu | head -15
If it's CPU-bound
Low idle, low I/O wait, one or more processes pinning the CPU. This is the cleaner case to diagnose.
# Which processes are eating CPU right now
ps aux --sort=-%cpu | head -10
# Per-CPU breakdown (useful on multi-core systems)
mpstat -P ALL 1 5
# Watch CPU usage per process over time
pidstat 1 10
Common CPU-bound causes and what to look for:
| Cause | What you'll see | Quick check |
|---|---|---|
| Runaway process / infinite loop | One process at 99%+ CPU continuously | ps aux --sort=-%cpu | head -5 |
| Legitimate high load (batch job, build) | Expected process using CPU, started recently | ps aux --sort=-%cpu | head -5 |
| Too many processes competing | Many processes each using 5–20%, r column in vmstat > core count |
vmstat 1 5 |
| Crypto / compression workload | openssl, gzip, tar showing high CPU | ps aux | grep -E 'gzip|tar|openssl' |
The r column in vmstat is worth knowing — it shows how many processes are actively waiting for CPU time right now, not just over the last minute. If r consistently exceeds your core count, the CPU is genuinely saturated.
# r column = run queue length
vmstat -w 1 10
If it's I/O-bound
This is trickier and more common than people expect, especially on database servers, log-heavy applications, and anything doing frequent small writes. High wa in top is your tell.
# Per-device I/O stats — the most useful tool here
iostat -xz 1 5
Look at these columns per device:
await— average I/O response time in milliseconds. Under 10ms is healthy for spinning disk. Under 1ms for SSD. If you're seeing 200ms+ on an SSD, something is very wrong.%util— what percentage of time the device was busy. Above 80–90% consistently means the disk is saturated.r/s,w/s— reads and writes per second. High write rates combined with high await is a classic sign of a write bottleneck.
# Find which processes are doing the most I/O right now
iotop -o -b -n 3
# If iotop isn't installed
pidstat -d 1 5
wa in top is a per-CPU metric, and it only shows non-zero when that CPU has nothing else to do while waiting for I/O. A heavily loaded system might have low wa even with significant I/O pressure, because the CPUs are kept busy with other work. Don't rule out I/O problems just because wa looks low — check iostat directly.
Common I/O-bound causes
# Check if a specific directory is getting hammered
inotifywait -m -r /var/log # watch for file events (Ctrl+C to stop)
# Check for processes in uninterruptible sleep (state D)
# These are blocked waiting for I/O and count toward load average
ps aux | awk '$8 == "D" {print}'
# Count D-state processes
ps -eo state | grep -c '^D'
Processes in state D (uninterruptible sleep) are the ones actually inflating your load average. They're waiting for I/O that hasn't returned yet — often a sign of a slow or overloaded disk, NFS issues, or a dying drive.
Too many D-state processes
If you find a pile of D-state processes, the disk is almost certainly the problem. Check for hardware errors first:
# Kernel messages about disk errors
dmesg -T | grep -iE 'error|failed|timeout|reset|ata' | tail -30
# Check disk health (if smartmontools is installed)
smartctl -a /dev/sda | grep -E 'Reallocated|Pending|Uncorrectable|Temperature'
# I/O error counts from the kernel
cat /sys/block/sda/stat
If dmesg is full of ATA errors or timeout messages, you may have a failing drive. That's not a Linux problem to tune your way out of — that's a hardware replacement situation.
Don't forget swap
Memory pressure causes load average to spike in a way that's easy to misread as a CPU problem. When a system starts swapping heavily, disk I/O goes up, processes block waiting, and everything grinds. The CPU might look fine while the server is actually dying of swap exhaustion.
# Quick memory and swap check
free -h
# Is swap actively being used?
vmstat 1 5
# Watch the si (swap-in) and so (swap-out) columns
# Non-zero values = active swapping = bad
# Which processes are using swap
for f in /proc/*/status; do
awk '/^(Name|VmSwap)/{printf "%s ",$2}' "$f"
echo
done | sort -k2 -rn | head -10
Is it getting better or worse?
The three load average numbers (1, 5, 15 minute) tell you the direction of travel, which matters as much as the absolute value.
| Pattern | What it means | Urgency |
|---|---|---|
| 1min > 5min > 15min | Load is rising — something is getting worse right now | High — investigate immediately |
| 1min < 5min < 15min | Load is dropping — the worst may be over | Medium — still find the cause |
| All three roughly equal | Sustained load — has been like this for a while | Medium — likely a configuration or capacity issue |
| 1min spike, 15min normal | Short burst — batch job, cron task, traffic spike | Low — check cron logs and move on |
The 5-minute checklist
When you need to move fast and don't have time to read the whole article (we've all been there):
# 1. How many cores do I have?
nproc
# 2. What's the load relative to core count?
uptime
# 3. CPU-bound or I/O-bound?
top # check %id and %wa columns
# 4. What processes are using the most CPU?
ps aux --sort=-%cpu | head -10
# 5. Is disk I/O the problem?
iostat -xz 1 3
# 6. Any processes blocked on I/O?
ps aux | awk '$8 == "D" {print}'
# 7. Is swap involved?
free -h && vmstat 1 3
# 8. Any kernel errors?
dmesg -T | grep -iE 'error|oom|killed' | tail -20
grep CRON /var/log/syslog | tail -20), look at recently modified files (find /var /tmp -newer /proc/1 -type f 2>/dev/null | head -20), and check for any deployments or config changes in the last hour. More often than not, something changed — it didn't just spontaneously break.
Preventing it long-term
Diagnosing high load during an incident is reactive. Once things are stable, it's worth setting up something proactive so you're not flying blind next time:
- Install
sysstat— it runssaron a cron schedule and keeps 28 days of historical CPU, I/O, and memory data. Invaluable for answering "was load always this high at 3 AM?" after the fact. - Set load average alerts at 2× core count — anything above that deserves attention. At 4× core count, something is probably very wrong.
- Know your baseline — a server with a load of 2.0 at peak hours might be perfectly normal. You can't recognize abnormal without knowing what normal looks like.
# Install sysstat for historical data
apt install sysstat # Debian/Ubuntu
yum install sysstat # RHEL/CentOS
# View yesterday's CPU stats at hourly intervals
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)
# View I/O stats for today
sar -d 1 5