Why does Ollama say out of memory when I have 8GB free VRAM?

The KV cache for your context window, layer offloading misconfiguration, or a multi-GPU misallocation bug can exhaust VRAM without touching your free pool. Start with the diagnosis flowchart below.

Is the multi-GPU OOM bug fixed?

Partially. GitHub issues #8377 and #6382 address layer misallocation across multiple cards. Check your Ollama version and the issue tracker to see if your setup is covered. If not, set `num_gpu` conservatively.

What's the difference between context length and layer offloading?

Context length (your prompt + response) inflates the KV cache in VRAM; layer offloading (num_gpu layers sent to the GPU) draws from the same pool. Both compete for the same memory. Reduce context, reduce layers, or add VRAM.

Can I fix this with settings instead of buying a bigger GPU?

Often yes—start with the diagnosis steps. But if the model exceeds your card's VRAM at all quantizations, the fix is a larger card or a smaller model. See 'When the fix is an upgrade.'

Fixing Ollama's 'CUDA Error: Out of Memory' — Even When You Have Free VRAM

Ollama throws CUDA Error: Out of Memory and your GPU has gigabytes of free VRAM staring back at you. The error is real, but the diagnosis is not obvious. The missing gigabytes are tied up in the KV cache for your context window, incorrect layer-offload settings, or the multi-GPU misallocation bug that has been documented in Ollama’s issue tracker since 2024. Only rarely is the honest answer: the model does not fit.

This guide walks through the diagnosis in order of likelihood and the fix for each. It is heavy on settings and light on guesswork.

The core conflict: KV cache + layer offloading = VRAM ceiling

Before troubleshooting, understand why the error happens even with free VRAM.

When you run a model in Ollama:

Model weights load into VRAM (the bulk of the memory).
Layer offloading (the num_gpu parameter) decides how many layers live on the GPU vs. CPU.
KV cache for your context window (the tokens you provide + tokens the model generates) lives in the same VRAM pool.

The KV cache is the hidden cost. A 4,096-token context on a 7B model can consume 1–2 GB of VRAM depending on precision, and that memory compounds as your context grows. If you set num_gpu to offload many layers (trying to maximize speed) and use a long context window, both compete for the same VRAM pool. Ollama can allocate layers successfully but fail to reserve space for the KV cache — and that is when you get the out-of-memory error with free VRAM on the card.

This is not a bug when it happens this way; it is a resource ceiling being hit from two directions at once. But when the same OOM happens with genuinely free VRAM (verified via nvidia-smi), the multi-GPU misallocation bug is the likely culprit.

Diagnosis flowchart: where the missing memory went

Use this table and the steps below to pinpoint your root cause.

Symptom	Likely cause	First check
High context length, OOM error	KV cache too large	Reduce `context_length` in `Ollama.modelfile`
`num_gpu` set high, OOM error	Too many layers offloaded	Reduce `num_gpu` or test default behavior
Multiple GPUs, OOM on one card while others sit free	Multi-GPU misallocation bug	Verify Ollama version; check GitHub #8377, #6382
Small model, modest context, OOM	Model exceeds VRAM at all quants	Downquantize or switch models

Step 1: Check your actual VRAM usage with nvidia-smi

Open a terminal and run:

watch -n 0.5 nvidia-smi

Run your Ollama query in another window. Watch the memory column in real time. You are looking for:

Memory used climbing to the card’s limit (12GB, 24GB, etc.) with no free pool. This means VRAM is genuinely exhausted. The model + KV cache exceed what fits.
Memory used staying well below the limit but the OOM error still fires. This is the misallocation symptom and points to the multi-GPU bug (step 3).

Step 2: Reduce context length first

The simplest fix: cap the context window your client sends to Ollama. If you are using the Ollama chat CLI, the default context is very large (often 2,048+ tokens). Test with a much shorter context:

ollama run llama2 --num_predict 512

Or in a client that exposes the context parameter (e.g., python’s ollama library):

response = client.generate(
    model="llama2",
    prompt="your prompt here",
    stream=False,
    options={"num_predict": 512}
)

If the error disappears, your KV cache was the culprit. You can then gradually increase num_predict to find the sweet spot for your hardware.

Step 3: Check layer offloading (num_gpu parameter)

Ollama’s num_gpu parameter controls how many model layers run on the GPU. By default, Ollama tries to offload as many layers as possible, and this is where things break under certain conditions.

Check what Ollama is currently doing. Create a simple Modelfile:

FROM llama2
PARAMETER num_gpu 0

and run:

ollama create test-model -f Modelfile
ollama run test-model

Setting num_gpu 0 forces all layers to CPU (slow, but safe for testing). If the error vanishes, you have found your culprit: the offloading strategy.

Then test incrementally:

FROM llama2
PARAMETER num_gpu 10

Restart and try again. Increment until the error returns. The highest stable value is your safe num_gpu setting for that model and context length.

For multi-GPU setups: The offloading math changes. See step 4 below.

Step 4: Diagnose the multi-GPU misallocation bug

If you have two or more GPUs and nvidia-smi shows free VRAM on card 0 or card 1 while Ollama reports OOM, the bug is likely in play.

GitHub issues to consult:

#8377 — Layer misallocation across multiple GPUs; Ollama sometimes piles layers onto one card while leaving others underused.
#6382 — CUDA OOM on multi-GPU setups despite free VRAM elsewhere.
#14632 — Ongoing reports of OOM with free VRAM in multi-GPU configurations (as of early 2026).

These issues remain open or partially addressed. The workaround is conservative layer offloading:

FROM llama2
PARAMETER num_gpu 10

Set num_gpu to a low value (10–15 layers, depending on your model size) and test. This prevents Ollama from over-allocating to a single card. If that does not work, try num_gpu 1 (minimal offloading) to confirm the bug.

If minimal offloading works, you are hitting the misallocation bug. Watch the GitHub issues for a fix, or file a new issue with:

Your nvidia-smi output during the OOM error
Ollama version (ollama --version)
The model and context you are using
Your exact Modelfile

How to fix each diagnosis

Fix 1: Reduce KV cache by lowering context

Add this to your Modelfile or client call:

FROM llama2
PARAMETER num_predict 512

This limits the model to a 512-token response window. For chat, also cap the history sent to the model. Many clients have a setting like context_window or max_history_tokens — set it to 2,048 or lower and test.

Trade-off: Shorter context means the model “forgets” older parts of the conversation. For many local use cases, 2,048 tokens is plenty.

Fix 2: Reduce num_gpu layers offloaded

FROM llama2
PARAMETER num_gpu 15

Start conservative (10–20 layers, depending on your model—smaller models have fewer total layers). Test and increment until you hit the ceiling. You are trading speed for stability: fewer offloaded layers means more CPU fallback, which is slower but more stable.

If the model is small (7B or 8B), try:

PARAMETER num_gpu -1

The -1 setting lets Ollama estimate the safe number. It is slower than manually optimizing but avoids the guesswork.

Fix 3: Multi-GPU workaround

If you have two GPUs and are hitting the misallocation bug:

Verify the bug is in play: Run nvidia-smi during the OOM error and confirm free VRAM on one card and high usage on another.
Set conservative offloading:
```
PARAMETER num_gpu 15
```
Check your CUDA_VISIBLE_DEVICES to ensure Ollama sees both cards:
```
CUDA_VISIBLE_DEVICES=0,1 ollama serve
```
Monitor the issue tracker — this is a known bug and may be patched in a future Ollama release. Check #8377 and #6382 for fix status.

Fix 4: The honest branch — the model does not fit

If you have:

Set num_gpu to 0 (CPU-only, no GPU acceleration) and the error persists.
Reduced num_predict to 100 tokens.
Confirmed with nvidia-smi that VRAM is genuinely exhausted during the error.

Then the model exceeds your VRAM at that quantization. The fixes are:

Use a smaller quantization — switch from Q8_0 (8 bits) to Q4_K_M (4 bits). See which quant should I download for the trade-offs.
Use a smaller model — 7B instead of 13B, or 13B instead of 70B. The VRAM calculator estimates the footprint for each.
Add VRAM — a second GPU or a larger card.

Do not buy hardware if you have not tried the quantization path first. A smaller quantization often buys you room without spending money.

Understanding the multi-GPU misallocation bug

This bug deserves detail because it is the most confusing symptom.

In a two-GPU setup (e.g., two RTX 3090s), Ollama should distribute model layers across both cards. If you have 60 layers in your model, card 0 might load 30 and card 1 might load 30, splitting the load.

The bug: Ollama sometimes loads layers 0–50 onto card 0 and fewer onto card 1, leaving card 1 underutilized. Then when the KV cache grows, card 0 runs out of VRAM while card 1 has gigabytes free. You see the OOM error even though the system has plenty of free memory.

Why it is hard to diagnose: NVIDIA-smi shows the free memory; Ollama reports out-of-memory. Both are true from their perspective, but you end up chasing a phantom problem because you are looking at total free VRAM instead of per-card VRAM.

Current status:

#8377 — Reports the bug and proposes layer-distribution fixes (open as of 2026-06-29, some partial patches).
#6382 — Multi-GPU OOM despite free VRAM; discussion of root causes and workarounds.
#14632 — Continuing reports; no complete fix yet.

Check your Ollama version against these issues. If your version is mentioned as fixed, you should not hit the bug. If not, use the conservative num_gpu workaround above.

When the fix is an upgrade

Be honest with yourself. If you have:

A 12 GB RTX 3060 trying to run a 13B model at Q8_0 (which needs ~14 GB just for weights).
A single 8 GB card trying to run anything larger than a 7B Q4_K_M model.
A genuine VRAM need (weights + KV cache + offloading headroom) that exceeds your card’s capacity at all reasonable quantizations.

Then no setting adjustment will help. The fix is one of:

Downquantize — use Q4_K_M instead of Q8_0 or Q6_K. Smaller footprint, slightly lower quality.
Downsize the model — switch to a 7B instead of 13B, or 13B instead of 70B. Smaller model, faster too.
Add VRAM — if you are a few gigabytes short, a second used GPU (like a used RTX 3090) fills the gap. If the model fundamentally exceeds 24 GB, you are in 70B+ territory and the economics shift. Check out rent vs. buy for GPU break-even to see whether renting makes sense.

The VRAM calculator and which quant should I download are your friends here. Use them before deciding you need to upgrade.

Checking what is actually happening

Create a test Modelfile with explicit settings:

FROM llama2
PARAMETER num_predict 512
PARAMETER num_gpu 10

Run Ollama in verbose mode to see what happens during load:

OLLAMA_DEBUG=1 ollama run test-model "test prompt"

Watch the logs. You will see:

How many layers are being offloaded.
Whether both GPUs (if you have two) are being used.
Memory allocation during load and inference.

Share these logs (without sensitive data) if you file a GitHub issue.

Bottom line

Ollama’s “out of memory” error with free VRAM is a real problem, but it almost always traces back to one of four causes:

KV cache inflated by a long context — reduce num_predict.
Too many layers offloaded — lower num_gpu.
Multi-GPU misallocation bug — use conservative offloading and check the GitHub issues.
Model genuinely does not fit — downquantize or downsize the model.

Start with the diagnosis flowchart. Test each fix in order (context, then offloading, then multi-GPU workaround, then quantization). If you hit the third branch and find you have a genuine VRAM shortfall, that is honest information and worth using before you decide to buy hardware. The tool ecosystem (VRAM calculator, build planner) exists to keep you from buying wrong.