Fixing Ollama's 'CUDA Error: Out of Memory' — Even When You Have Free VRAM
Ollama throws CUDA Error: Out of Memory and your GPU has gigabytes of free VRAM staring back at you. The error is real, but the diagnosis is not obvious. The missing gigabytes are tied up in the KV cache for your context window, incorrect layer-offload settings, or the multi-GPU misallocation bug that has been documented in Ollama’s issue tracker since 2024. Only rarely is the honest answer: the model does not fit.
This guide walks through the diagnosis in order of likelihood and the fix for each. It is heavy on settings and light on guesswork.
The core conflict: KV cache + layer offloading = VRAM ceiling
Before troubleshooting, understand why the error happens even with free VRAM.
When you run a model in Ollama:
- Model weights load into VRAM (the bulk of the memory).
- Layer offloading (the
num_gpuparameter) decides how many layers live on the GPU vs. CPU. - KV cache for your context window (the tokens you provide + tokens the model generates) lives in the same VRAM pool.
The KV cache is the hidden cost. A 4,096-token context on a 7B model can consume 1–2 GB of VRAM depending on precision, and that memory compounds as your context grows. If you set num_gpu to offload many layers (trying to maximize speed) and use a long context window, both compete for the same VRAM pool. Ollama can allocate layers successfully but fail to reserve space for the KV cache — and that is when you get the out-of-memory error with free VRAM on the card.
This is not a bug when it happens this way; it is a resource ceiling being hit from two directions at once. But when the same OOM happens with genuinely free VRAM (verified via nvidia-smi), the multi-GPU misallocation bug is the likely culprit.
Diagnosis flowchart: where the missing memory went
Use this table and the steps below to pinpoint your root cause.
| Symptom | Likely cause | First check |
|---|---|---|
| High context length, OOM error | KV cache too large | Reduce context_length in Ollama.modelfile |
num_gpu set high, OOM error | Too many layers offloaded | Reduce num_gpu or test default behavior |
| Multiple GPUs, OOM on one card while others sit free | Multi-GPU misallocation bug | Verify Ollama version; check GitHub #8377, #6382 |
| Small model, modest context, OOM | Model exceeds VRAM at all quants | Downquantize or switch models |
Step 1: Check your actual VRAM usage with nvidia-smi
Open a terminal and run:
watch -n 0.5 nvidia-smi
Run your Ollama query in another window. Watch the memory column in real time. You are looking for:
- Memory used climbing to the card’s limit (12GB, 24GB, etc.) with no free pool. This means VRAM is genuinely exhausted. The model + KV cache exceed what fits.
- Memory used staying well below the limit but the OOM error still fires. This is the misallocation symptom and points to the multi-GPU bug (step 3).
Step 2: Reduce context length first
The simplest fix: cap the context window your client sends to Ollama. If you are using the Ollama chat CLI, the default context is very large (often 2,048+ tokens). Test with a much shorter context:
ollama run llama2 --num_predict 512
Or in a client that exposes the context parameter (e.g., python’s ollama library):
response = client.generate(
model="llama2",
prompt="your prompt here",
stream=False,
options={"num_predict": 512}
)
If the error disappears, your KV cache was the culprit. You can then gradually increase num_predict to find the sweet spot for your hardware.
Step 3: Check layer offloading (num_gpu parameter)
Ollama’s num_gpu parameter controls how many model layers run on the GPU. By default, Ollama tries to offload as many layers as possible, and this is where things break under certain conditions.
Check what Ollama is currently doing. Create a simple Modelfile:
FROM llama2
PARAMETER num_gpu 0
and run:
ollama create test-model -f Modelfile
ollama run test-model
Setting num_gpu 0 forces all layers to CPU (slow, but safe for testing). If the error vanishes, you have found your culprit: the offloading strategy.
Then test incrementally:
FROM llama2
PARAMETER num_gpu 10
Restart and try again. Increment until the error returns. The highest stable value is your safe num_gpu setting for that model and context length.
For multi-GPU setups: The offloading math changes. See step 4 below.
Step 4: Diagnose the multi-GPU misallocation bug
If you have two or more GPUs and nvidia-smi shows free VRAM on card 0 or card 1 while Ollama reports OOM, the bug is likely in play.
GitHub issues to consult:
- #8377 — Layer misallocation across multiple GPUs; Ollama sometimes piles layers onto one card while leaving others underused.
- #6382 — CUDA OOM on multi-GPU setups despite free VRAM elsewhere.
- #14632 — Ongoing reports of OOM with free VRAM in multi-GPU configurations (as of early 2026).
These issues remain open or partially addressed. The workaround is conservative layer offloading:
FROM llama2
PARAMETER num_gpu 10
Set num_gpu to a low value (10–15 layers, depending on your model size) and test. This prevents Ollama from over-allocating to a single card. If that does not work, try num_gpu 1 (minimal offloading) to confirm the bug.
If minimal offloading works, you are hitting the misallocation bug. Watch the GitHub issues for a fix, or file a new issue with:
- Your
nvidia-smioutput during the OOM error - Ollama version (
ollama --version) - The model and context you are using
- Your exact Modelfile
How to fix each diagnosis
Fix 1: Reduce KV cache by lowering context
Add this to your Modelfile or client call:
FROM llama2
PARAMETER num_predict 512
This limits the model to a 512-token response window. For chat, also cap the history sent to the model. Many clients have a setting like context_window or max_history_tokens — set it to 2,048 or lower and test.
Trade-off: Shorter context means the model “forgets” older parts of the conversation. For many local use cases, 2,048 tokens is plenty.
Fix 2: Reduce num_gpu layers offloaded
FROM llama2
PARAMETER num_gpu 15
Start conservative (10–20 layers, depending on your model—smaller models have fewer total layers). Test and increment until you hit the ceiling. You are trading speed for stability: fewer offloaded layers means more CPU fallback, which is slower but more stable.
If the model is small (7B or 8B), try:
PARAMETER num_gpu -1
The -1 setting lets Ollama estimate the safe number. It is slower than manually optimizing but avoids the guesswork.
Fix 3: Multi-GPU workaround
If you have two GPUs and are hitting the misallocation bug:
- Verify the bug is in play: Run
nvidia-smiduring the OOM error and confirm free VRAM on one card and high usage on another. - Set conservative offloading:
PARAMETER num_gpu 15 - Check your CUDA_VISIBLE_DEVICES to ensure Ollama sees both cards:
CUDA_VISIBLE_DEVICES=0,1 ollama serve - Monitor the issue tracker — this is a known bug and may be patched in a future Ollama release. Check #8377 and #6382 for fix status.
Fix 4: The honest branch — the model does not fit
If you have:
- Set
num_gputo 0 (CPU-only, no GPU acceleration) and the error persists. - Reduced
num_predictto 100 tokens. - Confirmed with
nvidia-smithat VRAM is genuinely exhausted during the error.
Then the model exceeds your VRAM at that quantization. The fixes are:
- Use a smaller quantization — switch from Q8_0 (8 bits) to Q4_K_M (4 bits). See which quant should I download for the trade-offs.
- Use a smaller model — 7B instead of 13B, or 13B instead of 70B. The VRAM calculator estimates the footprint for each.
- Add VRAM — a second GPU or a larger card.
Do not buy hardware if you have not tried the quantization path first. A smaller quantization often buys you room without spending money.
Understanding the multi-GPU misallocation bug
This bug deserves detail because it is the most confusing symptom.
In a two-GPU setup (e.g., two RTX 3090s), Ollama should distribute model layers across both cards. If you have 60 layers in your model, card 0 might load 30 and card 1 might load 30, splitting the load.
The bug: Ollama sometimes loads layers 0–50 onto card 0 and fewer onto card 1, leaving card 1 underutilized. Then when the KV cache grows, card 0 runs out of VRAM while card 1 has gigabytes free. You see the OOM error even though the system has plenty of free memory.
Why it is hard to diagnose: NVIDIA-smi shows the free memory; Ollama reports out-of-memory. Both are true from their perspective, but you end up chasing a phantom problem because you are looking at total free VRAM instead of per-card VRAM.
Current status:
- #8377 — Reports the bug and proposes layer-distribution fixes (open as of 2026-06-29, some partial patches).
- #6382 — Multi-GPU OOM despite free VRAM; discussion of root causes and workarounds.
- #14632 — Continuing reports; no complete fix yet.
Check your Ollama version against these issues. If your version is mentioned as fixed, you should not hit the bug. If not, use the conservative num_gpu workaround above.
When the fix is an upgrade
Be honest with yourself. If you have:
- A 12 GB RTX 3060 trying to run a 13B model at Q8_0 (which needs ~14 GB just for weights).
- A single 8 GB card trying to run anything larger than a 7B Q4_K_M model.
- A genuine VRAM need (weights + KV cache + offloading headroom) that exceeds your card’s capacity at all reasonable quantizations.
Then no setting adjustment will help. The fix is one of:
- Downquantize — use Q4_K_M instead of Q8_0 or Q6_K. Smaller footprint, slightly lower quality.
- Downsize the model — switch to a 7B instead of 13B, or 13B instead of 70B. Smaller model, faster too.
- Add VRAM — if you are a few gigabytes short, a second used GPU (like a used RTX 3090) fills the gap. If the model fundamentally exceeds 24 GB, you are in 70B+ territory and the economics shift. Check out rent vs. buy for GPU break-even to see whether renting makes sense.
The VRAM calculator and which quant should I download are your friends here. Use them before deciding you need to upgrade.
Checking what is actually happening
Create a test Modelfile with explicit settings:
FROM llama2
PARAMETER num_predict 512
PARAMETER num_gpu 10
Run Ollama in verbose mode to see what happens during load:
OLLAMA_DEBUG=1 ollama run test-model "test prompt"
Watch the logs. You will see:
- How many layers are being offloaded.
- Whether both GPUs (if you have two) are being used.
- Memory allocation during load and inference.
Share these logs (without sensitive data) if you file a GitHub issue.
Bottom line
Ollama’s “out of memory” error with free VRAM is a real problem, but it almost always traces back to one of four causes:
- KV cache inflated by a long context — reduce
num_predict. - Too many layers offloaded — lower
num_gpu. - Multi-GPU misallocation bug — use conservative offloading and check the GitHub issues.
- Model genuinely does not fit — downquantize or downsize the model.
Start with the diagnosis flowchart. Test each fix in order (context, then offloading, then multi-GPU workaround, then quantization). If you hit the third branch and find you have a genuine VRAM shortfall, that is honest information and worth using before you decide to buy hardware. The tool ecosystem (VRAM calculator, build planner) exists to keep you from buying wrong.