How to Diagnose High Load Average on Linux

Your monitoring alert fired. Load average is sitting at 47 on a 4-core server and your SSH session feels like it's running through wet concrete. Before you do anything dramatic — and definitely before you reboot a production server at 2 AM — take a breath. High load average is a symptom, not a diagnosis. This guide will help you figure out what's actually going on.

What load average actually means

Run uptime and you'll see three numbers:

 14:32:41 up 42 days,  3:17,  2 users,  load average: 3.42, 2.91, 2.15

Those three numbers are the average number of processes in a runnable or uninterruptible state over the last 1 minute, 5 minutes, and 15 minutes. That's it. It's a queue length, not a percentage.

Here's what trips people up: on Linux (unlike other Unix systems), load average counts both processes waiting for CPU and processes waiting for I/O. A server pegged on a slow disk will show a high load average even if your CPUs are mostly idle. This is the single most important thing to understand about load average on Linux — and the reason "high load" requires actual investigation before you know what to do about it.

The golden rule: divide load average by the number of CPU cores. A load of 4.0 on a single-core machine is a disaster. The same number on a 16-core machine means the CPUs are 25% busy — totally fine. Use nproc to check your core count.

First commands to run

When you land on a high-load server, run these in order. Each one narrows down where to look next.

Step 1 — Check how many cores you have

nproc
# or for full CPU topology
lscpu | grep -E '^CPU\(s\)|^Core|^Thread|^Socket'

This sets your baseline. A load of 8.0 means very different things on a 2-core VM versus a 32-core bare metal box.

Step 2 — Look at top

top

The header line is what you're after first:

%Cpu(s): 87.3 us,  6.2 sy,  0.0 ni,  3.1 id,  2.8 wa,  0.0 hi,  0.3 si

Two columns tell the story:

id (idle) — if this is near zero, your CPUs are saturated. You have a CPU-bound problem.
wa (I/O wait) — if this is high (above 10–20%), processes are spending their time waiting for disk or network I/O. You have an I/O-bound problem.

Both can be high at the same time on a busy server, but usually one dominates. That's your fork in the road.

Step 3 — Check the process list

While still in top, press M to sort by memory, P to sort by CPU. Look for anything consuming an unexpectedly large share. A runaway process — backup job, log rotation script, cron task — is often the culprit and the easiest fix.

# If you prefer a non-interactive snapshot
ps aux --sort=-%cpu | head -15

If it's CPU-bound

Low idle, low I/O wait, one or more processes pinning the CPU. This is the cleaner case to diagnose.

# Which processes are eating CPU right now
ps aux --sort=-%cpu | head -10

# Per-CPU breakdown (useful on multi-core systems)
mpstat -P ALL 1 5

# Watch CPU usage per process over time
pidstat 1 10

Common CPU-bound causes and what to look for:

Cause	What you'll see	Quick check
Runaway process / infinite loop	One process at 99%+ CPU continuously	`ps aux --sort=-%cpu \| head -5`
Legitimate high load (batch job, build)	Expected process using CPU, started recently	`ps aux --sort=-%cpu \| head -5`
Too many processes competing	Many processes each using 5–20%, `r` column in vmstat > core count	`vmstat 1 5`
Crypto / compression workload	openssl, gzip, tar showing high CPU	`ps aux \| grep -E 'gzip\|tar\|openssl'`

The r column in vmstat is worth knowing — it shows how many processes are actively waiting for CPU time right now, not just over the last minute. If r consistently exceeds your core count, the CPU is genuinely saturated.

# r column = run queue length
vmstat -w 1 10

If it's I/O-bound

This is trickier and more common than people expect, especially on database servers, log-heavy applications, and anything doing frequent small writes. High wa in top is your tell.

# Per-device I/O stats — the most useful tool here
iostat -xz 1 5

Look at these columns per device:

await — average I/O response time in milliseconds. Under 10ms is healthy for spinning disk. Under 1ms for SSD. If you're seeing 200ms+ on an SSD, something is very wrong.
%util — what percentage of time the device was busy. Above 80–90% consistently means the disk is saturated.
r/s, w/s — reads and writes per second. High write rates combined with high await is a classic sign of a write bottleneck.

# Find which processes are doing the most I/O right now
iotop -o -b -n 3

# If iotop isn't installed
pidstat -d 1 5

Dirty secret about I/O wait: wa in top is a per-CPU metric, and it only shows non-zero when that CPU has nothing else to do while waiting for I/O. A heavily loaded system might have low wa even with significant I/O pressure, because the CPUs are kept busy with other work. Don't rule out I/O problems just because wa looks low — check iostat directly.

Common I/O-bound causes

# Check if a specific directory is getting hammered
inotifywait -m -r /var/log   # watch for file events (Ctrl+C to stop)

# Check for processes in uninterruptible sleep (state D)
# These are blocked waiting for I/O and count toward load average
ps aux | awk '$8 == "D" {print}'

# Count D-state processes
ps -eo state | grep -c '^D'

Processes in state D (uninterruptible sleep) are the ones actually inflating your load average. They're waiting for I/O that hasn't returned yet — often a sign of a slow or overloaded disk, NFS issues, or a dying drive.

Too many D-state processes

If you find a pile of D-state processes, the disk is almost certainly the problem. Check for hardware errors first:

# Kernel messages about disk errors
dmesg -T | grep -iE 'error|failed|timeout|reset|ata' | tail -30

# Check disk health (if smartmontools is installed)
smartctl -a /dev/sda | grep -E 'Reallocated|Pending|Uncorrectable|Temperature'

# I/O error counts from the kernel
cat /sys/block/sda/stat

If dmesg is full of ATA errors or timeout messages, you may have a failing drive. That's not a Linux problem to tune your way out of — that's a hardware replacement situation.

Don't forget swap

Memory pressure causes load average to spike in a way that's easy to misread as a CPU problem. When a system starts swapping heavily, disk I/O goes up, processes block waiting, and everything grinds. The CPU might look fine while the server is actually dying of swap exhaustion.

# Quick memory and swap check
free -h

# Is swap actively being used?
vmstat 1 5
# Watch the si (swap-in) and so (swap-out) columns
# Non-zero values = active swapping = bad

# Which processes are using swap
for f in /proc/*/status; do
  awk '/^(Name|VmSwap)/{printf "%s ",$2}' "$f"
  echo
done | sort -k2 -rn | head -10

The three load average numbers (1, 5, 15 minute) tell you the direction of travel, which matters as much as the absolute value.

Pattern	What it means	Urgency
1min > 5min > 15min	Load is rising — something is getting worse right now	High — investigate immediately
1min < 5min < 15min	Load is dropping — the worst may be over	Medium — still find the cause
All three roughly equal	Sustained load — has been like this for a while	Medium — likely a configuration or capacity issue
1min spike, 15min normal	Short burst — batch job, cron task, traffic spike	Low — check cron logs and move on

The 5-minute checklist

When you need to move fast and don't have time to read the whole article (we've all been there):

# 1. How many cores do I have?
nproc

# 2. What's the load relative to core count?
uptime

# 3. CPU-bound or I/O-bound?
top   # check %id and %wa columns

# 4. What processes are using the most CPU?
ps aux --sort=-%cpu | head -10

# 5. Is disk I/O the problem?
iostat -xz 1 3

# 6. Any processes blocked on I/O?
ps aux | awk '$8 == "D" {print}'

# 7. Is swap involved?
free -h && vmstat 1 3

# 8. Any kernel errors?
dmesg -T | grep -iE 'error|oom|killed' | tail -20

When load is high but you can't find the cause: Check if a cron job ran recently (grep CRON /var/log/syslog | tail -20), look at recently modified files (find /var /tmp -newer /proc/1 -type f 2>/dev/null | head -20), and check for any deployments or config changes in the last hour. More often than not, something changed — it didn't just spontaneously break.

Preventing it long-term

Diagnosing high load during an incident is reactive. Once things are stable, it's worth setting up something proactive so you're not flying blind next time:

Install sysstat — it runs sar on a cron schedule and keeps 28 days of historical CPU, I/O, and memory data. Invaluable for answering "was load always this high at 3 AM?" after the fact.
Set load average alerts at 2× core count — anything above that deserves attention. At 4× core count, something is probably very wrong.
Know your baseline — a server with a load of 2.0 at peak hours might be perfectly normal. You can't recognize abnormal without knowing what normal looks like.

# Install sysstat for historical data
apt install sysstat        # Debian/Ubuntu
yum install sysstat        # RHEL/CentOS

# View yesterday's CPU stats at hourly intervals
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)

# View I/O stats for today
sar -d 1 5

How to Diagnose High Load Average on Linux

What load average actually means

First commands to run

Step 1 — Check how many cores you have

Step 2 — Look at top

Step 3 — Check the process list

If it's CPU-bound

If it's I/O-bound

Common I/O-bound causes

Too many D-state processes

Don't forget swap

Is it getting better or worse?

The 5-minute checklist

Preventing it long-term

Related guides

Check Memory Usage on Linux

Find Large Files and Directories

Linux Thread Count Explained