Why does my fourth GPU sit idle when I set tensor-parallel-size=3?

Tensor-parallel-size must divide the model's attention-head count evenly. A Llama 3 70B model has 64 attention heads; if you set tensor-parallel-size=3, the shards cannot divide evenly and the runtime fails or idles the remainder. Set tensor-parallel-size=4 instead — divisible into 64 cleanly — and all four GPUs work.

Should I use NVLink for vLLM tensor parallelism?

NVLink helps significantly on A100 or H100 clusters, but for consumer 3090s and 4090s (no NVLink), tensor parallelism reduces bandwidth contention for a given model size. Whether it is worth it depends on your model size and VRAM: single-GPU serving is simpler, but 2×3090 without NVLink is still faster than one 3090 for large 70B models because each card handles fewer attention heads. See the sizing section and our [NVLink guide](/gpus/is-nvlink-worth-it/).

Why does my vLLM pod OOM on the second request in a long conversation?

KV cache grows with context length. If you have allocated almost all 24 GB to the model weights, the remaining space for KV cache fills fast on longer conversations. Budget 4–5 GB of headroom per GPU for driver overhead, PyTorch initialization, and KV-cache buffer growth. A 24 GB card with a 19 GB model leaves 5 GB — enough for a single ~8k context; two concurrent requests will OOM.

Do I need a specific CUDA version for vLLM?

vLLM requires CUDA ≥12.1 (officially). Older CUDA versions lack PTX ISA support and runtime compilation features. Verify your driver (≥525) supports CUDA 12.1 before deploying. Docker containers isolate CUDA from your host, so you can run old host CUDA and new container CUDA in parallel.

Is tensor parallelism worth it for a single 7B model?

No. Tensor parallelism has per-request overhead and requires inter-GPU communication. For a single 7B model (roughly 14 GB at Q8, 7 GB at Q4) that fits on one card, single-GPU serving is simpler and faster. Tensor parallelism shines when the model does not fit on one card — e.g., a 70B model on 2×24GB cards.

vLLM Multi-GPU Setup: Tensor Parallelism Without the Idle-GPU Mistake

The most expensive mistake in vLLM multi-GPU setup is not hardware — it is misconfiguring tensor parallelism and watching one or more GPUs sit idle while requests time out. The second most expensive is running out of KV-cache memory mid-conversation and crashing the serving pod without warning.

This guide is for anyone running local LLM inference on two or more consumer GPUs using vLLM. It covers the actual deployment decisions: When does tensor parallelism matter? How many GPUs do you actually need for your model? What are the two configuration gotchas that burn beginners? And how do you ship it in Docker without scrambling at deploy time?

If you are new to vLLM itself, start with how to run LLMs locally and Ollama vs llama.cpp vs vLLM to understand where vLLM sits in the serving landscape. This page assumes you have vLLM installed and are ready to configure multi-GPU workloads.

What tensor parallelism actually does — and what it does not

Tensor parallelism splits a model’s weights across multiple GPUs, so each card handles a subset of the attention heads and feedforward layers. On a request, all GPUs compute in parallel, then synchronize over the interconnect (PCIe or NVLink). For a 7B model that fits on a single GPU, tensor parallelism adds overhead and latency; the single-GPU path is simpler and faster. For a 70B model that needs two cards, tensor parallelism is how the model runs at all.

Crucially: tensor parallelism is not a free throughput multiplier. A naive expectation is “2 GPUs = 2× tok/s.” Reality is more nuanced.

Single user, long context: Two 3090s via tensor parallelism will decode faster than one 3090, because each card computes fewer attention heads. The speedup is real but sublinear — maybe 1.4–1.6× depending on model and context length — because inter-GPU communication (over PCIe) costs bandwidth. Without NVLink, this is the ceiling. (For a deeper dive on NVLink economics, see is NVLink worth it?)
Multiple concurrent users (batching): vLLM’s continuous batching engine can queue many requests and amortize the tensor-parallel overhead. This is where multi-GPU serving shines — not for single-user speed, but for throughput under load.
Inference time overhead: Each token-generation step requires an all-gather (or all-reduce) synchronization across GPUs. This is fast on NVLink (high-bandwidth) but noticeably slower on PCIe. Plan for ~10–20% overhead on consumer cards without NVLink.

For a single-user, single-request workload (chat), the value of two cards is VRAM capacity, not speed. For a production serving pod handling concurrent requests, multi-GPU is about throughput per dollar.

Mistake 1: Tensor-parallel-size must divide the model’s attention-head count

This is the most common configuration error, and it leaves GPUs idle.

A Llama 3 70B model has 64 attention heads. If you set tensor-parallel-size=3, the runtime will fail to divide 64 heads evenly across 3 shards. The valid divisors of 64 are: 1, 2, 4, 8, 16, 32, 64. If you have 4 GPUs, set tensor-parallel-size=4. If you have 3 GPUs and a 64-head model, you cannot use all three without modifying the model architecture — you are limited to tensor-parallel-size=2 or tensor-parallel-size=4 (which wastes a card), or you accept that one card will sit idle or be used for other workloads.

Model	Attention Heads	Valid tensor-parallel-size values
Llama 3 7B	32	1, 2, 4, 8, 16, 32
Llama 3 8B	32	1, 2, 4, 8, 16, 32
Llama 3 70B	64	1, 2, 4, 8, 16, 32, 64
Mistral 7B	32	1, 2, 4, 8, 16, 32
Phi 3 3.8B	32	1, 2, 4, 8, 16, 32

Check your model before buying hardware. If you plan a 2-GPU setup and want a model with 48 attention heads (divisible only by 1, 2, 3, 4, 6, 8, 12, 16, 24, 48), you can set tensor-parallel-size=2 cleanly, or split the remainder as a fallback. If you want 4 GPUs with a 64-head model, you are golden. If you want 3 GPUs, you need a model whose head count is divisible by 3 — and many modern models are not. Check the architecture JSON in the model repo or ask before committing GPU budget.

To check a model’s head count:

# Download the model config and inspect the JSON
# For a Hugging Face model:
curl -s https://huggingface.co/meta-llama/Llama-2-70b/raw/main/config.json | jq .num_attention_heads

Mistake 2: KV-cache headroom — 4–5 GB per GPU is not optional

The second cluster of failures happens in production, after deployment. A single request works fine. A second request arrives, context grows, and the pod crashes with OutOfMemoryError even though you checked the model weights fit in VRAM.

The culprit is the KV cache — the cached key and value tensors for each token in the context window. For a 70B model with a 4K context on two 24GB cards, the KV cache alone can consume ~4–6 GB per card. If you have allocated 23 GB to weights and offloading, zero headroom remains.

Budget formula:

Model weights: Look up the unquantized size (e.g., Llama 3 70B = ~140 GB), divide by your quantization (FP16 = 1×, Q8 = 0.5×, Q4 = 0.25×).
Driver + PyTorch overhead: ~0.5–1 GB per GPU, non-negotiable.
KV-cache buffer per card: depends on max_num_seqs and max_model_len (context window). A rough heuristic: 1 GB per 1K tokens of context, per GPU, for concurrent requests. A 4K context with 2 concurrent requests = ~8 GB per card.
Headroom: Always allocate 1–2 GB extra per card to absorb runtime variance and avoid OOM crashes on edge cases.

Example: 2×24GB RTX 3090 serving Llama 3 70B at Q8 (70 GB):

Model weights: 70 GB ÷ 2 cards = 35 GB per card (but 24 GB card max).
This does not fit. Quantize to Q4 instead: 70 GB × 0.25 = 17.5 GB, or ~8.75 GB per card. Now there is 15.25 GB left.
Driver overhead: 1 GB per card.
KV-cache buffer: Plan for 4K context and 2 concurrent requests. ~4 GB per card.
Remaining: 15.25 - 1 - 4 = ~10 GB buffer. Safe.

If you run single-request serving (no batching), you can reduce KV-cache headroom, but do not eliminate it. vLLM will still allocate runtime buffers, and long conversations will grow context steadily.

Prerequisites: driver, CUDA, Docker, Container Toolkit

Before deploying vLLM on multiple GPUs, verify that your stack is at minimum version:

Component	Minimum	Tested (2025–2026)
NVIDIA Driver	525	535+
CUDA Toolkit	12.1	12.1–12.4
Docker	23.0	24.0+
NVIDIA Container Toolkit	1.14	1.14+
vLLM	—	0.4.0–0.6.0

Driver ≥525 is not negotiable. Older drivers do not support the PTX ISA features vLLM relies on for CUDA graph compilation. On Ubuntu:

nvidia-smi  # Check current driver version
# If < 525, update:
sudo apt update
sudo apt install nvidia-driver-535  # Or latest available
sudo reboot

CUDA ≥12.1: vLLM containers bundle CUDA, so your host CUDA version matters less. But if you are running vLLM directly (not containerized), verify:

nvcc --version  # Check CUDA toolkit version
# If absent or old:
# Install from https://developer.nvidia.com/cuda-downloads

Container Toolkit ≥1.14: Allows Docker containers to access NVIDIA GPUs. Install or upgrade:

# Ubuntu / Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install nvidia-docker2
sudo systemctl restart docker

Verify:

docker run --rm --gpus all nvidia/cuda:12.1.0-runtime-ubuntu22.04 nvidia-smi

All GPUs should appear. If not, the Container Toolkit is not wired correctly.

How many GPUs do you actually need for your model?

The straightforward answer is determined by VRAM and the tensor-parallel-size divisibility constraint.

1 GPU (single 24GB card like RTX 3090 or 4090):

7B–13B models: easy.
34B models: tight, requires Q4 or aggressive offloading.
70B models: does not fit.

2 GPUs (2×24GB, e.g., dual RTX 3090):

7B–13B models: no need for tensor parallelism; single-GPU serving is simpler. Use the second card for batching or other workloads.
34B models: tensor-parallel-size=2, comfortable.
70B models: tensor-parallel-size=2, works.
70B+ models: begins to crowd VRAM for long contexts.

4 GPUs (4×24GB):

70B models: tensor-parallel-size=4 if model has ≥4 attention heads. Plenty of room for concurrent requests.
70B–140B models: serious consideration.

3 GPUs: Avoid. Head counts rarely divide by 3. You will waste a card or hit configuration errors.

For a production setup under load (multiple concurrent users), the sweet spot for consumer hardware is 2×24GB RTX 3090 (see the dual-3090 build guide for the full rig) or upgrade to datacenter hardware. If you can rent GPUs, a RunPod 2×A100 or 2×H100 pod is a good way to test the math before buying.

Configuring vLLM: Docker setup guide

The cleaner, more reliable path is to run vLLM in Docker. The official image includes all CUDA bindings and avoids local CUDA version mismatches.

Step 1: Pull the vLLM image

docker pull vllm/vllm-openai:latest
# or a specific version:
docker pull vllm/vllm-openai:v0.6.0

Step 2: Download your model

# Example: Llama 3 70B Q4 variant (gguf format from Hugging Face)
huggingface-cli download Xwin-LM/Llama-3-70b-Q4_K_M --local-dir ./models
# Or use a full-precision model:
huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models

Step 3: Run vLLM with tensor parallelism

docker run -d \
  --gpus all \
  -v ./models:/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model /models/Llama-2-70b \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

Key flags:

--tensor-parallel-size 2: Use 2 GPUs (must divide model’s attention-head count).
--max-model-len 4096: Maximum context length. Adjust based on VRAM headroom.
--gpu-memory-utilization 0.85: Fraction of VRAM to use for the model. 0.85 leaves 15% headroom (~3.6 GB on a 24 GB card). Conservative but safe.
--dtype float16: Use FP16 (halves VRAM vs FP32). vLLM defaults to FP16 anyway.

Step 4: Query the API

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-2-70b",
    "prompt": "What is the capital of France?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Common configuration mistakes and fixes

Mistake	Symptom	Fix
`tensor-parallel-size` not divisible by model’s attention-head count	vLLM fails to load model or hangs	Check model config.json, set divisible tensor-parallel-size
Insufficient KV-cache headroom	OOM on 2nd concurrent request, pod crashes	Reduce `--max-model-len` or `--gpu-memory-utilization`, or add headroom formula above
`CUDA_VISIBLE_DEVICES` not set	vLLM sees all GPUs but config uses only one	Set `--tensor-parallel-size` or use `--gpus '"all"'` in Docker
Old driver (< 525)	CUDA errors, PTX compilation fails	Upgrade to driver ≥525
vLLM inside container sees 0 GPUs	Container runs but CPU-only	Verify `nvidia-docker2` installed, `docker run --gpus all` works with test image

Bottom line

vLLM makes multi-GPU serving accessible, but it is not a “install and forget” tool. The two gotchas — tensor-parallel-size divisibility and KV-cache headroom — are easy to get right if you check them before deployment. A production setup should:

Verify model’s attention-head count before buying GPUs.
Budget VRAM: model weights + 1 GB driver + KV-cache buffer + 1 GB headroom per card.
Run a dry run with Docker and a ~4K context to confirm no OOM.
Deploy with --gpu-memory-utilization < 0.9 to absorb surprises.

For most single-user or small-team inference, a 2×24GB RTX 3090 with tensor-parallel-size=2 is the practical sweet spot. If you need to test before buying, rent a multi-GPU pod on RunPod or another provider for a few dollars an hour.

Who This Is NOT For

This guide is for local, self-hosted vLLM deployments on consumer or small datacenter hardware. It is not the right guide if:

You are using a managed serving platform (Together.ai, Replicate, Anyscale). They handle tensor parallelism and VRAM budgeting behind the scenes. This guide is for when you own the infrastructure.
You are training or fine-tuning. That workload has different VRAM and communication patterns. vLLM is an inference engine; for training, see the appropriate fine-tuning docs.
You have only one GPU and want to use tensor parallelism. Single-GPU serving is simpler, lower-latency, and does not require inter-GPU communication. Use it.
You want to switch between single and multi-GPU serving without rebuilding. You can, but the model compile and CUDA graph differ between configurations; plan for deployment restarts.