vLLM Multi-GPU Setup: Tensor Parallelism Without the Idle-GPU Mistake
The most expensive mistake in vLLM multi-GPU setup is not hardware — it is misconfiguring tensor parallelism and watching one or more GPUs sit idle while requests time out. The second most expensive is running out of KV-cache memory mid-conversation and crashing the serving pod without warning.
This guide is for anyone running local LLM inference on two or more consumer GPUs using vLLM. It covers the actual deployment decisions: When does tensor parallelism matter? How many GPUs do you actually need for your model? What are the two configuration gotchas that burn beginners? And how do you ship it in Docker without scrambling at deploy time?
If you are new to vLLM itself, start with how to run LLMs locally and Ollama vs llama.cpp vs vLLM to understand where vLLM sits in the serving landscape. This page assumes you have vLLM installed and are ready to configure multi-GPU workloads.
What tensor parallelism actually does — and what it does not
Tensor parallelism splits a model’s weights across multiple GPUs, so each card handles a subset of the attention heads and feedforward layers. On a request, all GPUs compute in parallel, then synchronize over the interconnect (PCIe or NVLink). For a 7B model that fits on a single GPU, tensor parallelism adds overhead and latency; the single-GPU path is simpler and faster. For a 70B model that needs two cards, tensor parallelism is how the model runs at all.
Crucially: tensor parallelism is not a free throughput multiplier. A naive expectation is “2 GPUs = 2× tok/s.” Reality is more nuanced.
- Single user, long context: Two 3090s via tensor parallelism will decode faster than one 3090, because each card computes fewer attention heads. The speedup is real but sublinear — maybe 1.4–1.6× depending on model and context length — because inter-GPU communication (over PCIe) costs bandwidth. Without NVLink, this is the ceiling. (For a deeper dive on NVLink economics, see is NVLink worth it?)
- Multiple concurrent users (batching): vLLM’s continuous batching engine can queue many requests and amortize the tensor-parallel overhead. This is where multi-GPU serving shines — not for single-user speed, but for throughput under load.
- Inference time overhead: Each token-generation step requires an all-gather (or all-reduce) synchronization across GPUs. This is fast on NVLink (high-bandwidth) but noticeably slower on PCIe. Plan for ~10–20% overhead on consumer cards without NVLink.
For a single-user, single-request workload (chat), the value of two cards is VRAM capacity, not speed. For a production serving pod handling concurrent requests, multi-GPU is about throughput per dollar.
Mistake 1: Tensor-parallel-size must divide the model’s attention-head count
This is the most common configuration error, and it leaves GPUs idle.
A Llama 3 70B model has 64 attention heads. If you set tensor-parallel-size=3, the runtime will fail to divide 64 heads evenly across 3 shards. The valid divisors of 64 are: 1, 2, 4, 8, 16, 32, 64. If you have 4 GPUs, set tensor-parallel-size=4. If you have 3 GPUs and a 64-head model, you cannot use all three without modifying the model architecture — you are limited to tensor-parallel-size=2 or tensor-parallel-size=4 (which wastes a card), or you accept that one card will sit idle or be used for other workloads.
| Model | Attention Heads | Valid tensor-parallel-size values |
|---|---|---|
| Llama 3 7B | 32 | 1, 2, 4, 8, 16, 32 |
| Llama 3 8B | 32 | 1, 2, 4, 8, 16, 32 |
| Llama 3 70B | 64 | 1, 2, 4, 8, 16, 32, 64 |
| Mistral 7B | 32 | 1, 2, 4, 8, 16, 32 |
| Phi 3 3.8B | 32 | 1, 2, 4, 8, 16, 32 |
Check your model before buying hardware. If you plan a 2-GPU setup and want a model with 48 attention heads (divisible only by 1, 2, 3, 4, 6, 8, 12, 16, 24, 48), you can set tensor-parallel-size=2 cleanly, or split the remainder as a fallback. If you want 4 GPUs with a 64-head model, you are golden. If you want 3 GPUs, you need a model whose head count is divisible by 3 — and many modern models are not. Check the architecture JSON in the model repo or ask before committing GPU budget.
To check a model’s head count:
# Download the model config and inspect the JSON
# For a Hugging Face model:
curl -s https://huggingface.co/meta-llama/Llama-2-70b/raw/main/config.json | jq .num_attention_heads
Mistake 2: KV-cache headroom — 4–5 GB per GPU is not optional
The second cluster of failures happens in production, after deployment. A single request works fine. A second request arrives, context grows, and the pod crashes with OutOfMemoryError even though you checked the model weights fit in VRAM.
The culprit is the KV cache — the cached key and value tensors for each token in the context window. For a 70B model with a 4K context on two 24GB cards, the KV cache alone can consume ~4–6 GB per card. If you have allocated 23 GB to weights and offloading, zero headroom remains.
Budget formula:
- Model weights: Look up the unquantized size (e.g., Llama 3 70B = ~140 GB), divide by your quantization (FP16 = 1×, Q8 = 0.5×, Q4 = 0.25×).
- Driver + PyTorch overhead: ~0.5–1 GB per GPU, non-negotiable.
- KV-cache buffer per card: depends on
max_num_seqsandmax_model_len(context window). A rough heuristic: 1 GB per 1K tokens of context, per GPU, for concurrent requests. A 4K context with 2 concurrent requests = ~8 GB per card. - Headroom: Always allocate 1–2 GB extra per card to absorb runtime variance and avoid OOM crashes on edge cases.
Example: 2×24GB RTX 3090 serving Llama 3 70B at Q8 (70 GB):
- Model weights: 70 GB ÷ 2 cards = 35 GB per card (but 24 GB card max).
- This does not fit. Quantize to Q4 instead: 70 GB × 0.25 = 17.5 GB, or ~8.75 GB per card. Now there is 15.25 GB left.
- Driver overhead: 1 GB per card.
- KV-cache buffer: Plan for 4K context and 2 concurrent requests. ~4 GB per card.
- Remaining: 15.25 - 1 - 4 = ~10 GB buffer. Safe.
If you run single-request serving (no batching), you can reduce KV-cache headroom, but do not eliminate it. vLLM will still allocate runtime buffers, and long conversations will grow context steadily.
Prerequisites: driver, CUDA, Docker, Container Toolkit
Before deploying vLLM on multiple GPUs, verify that your stack is at minimum version:
| Component | Minimum | Tested (2025–2026) |
|---|---|---|
| NVIDIA Driver | 525 | 535+ |
| CUDA Toolkit | 12.1 | 12.1–12.4 |
| Docker | 23.0 | 24.0+ |
| NVIDIA Container Toolkit | 1.14 | 1.14+ |
| vLLM | — | 0.4.0–0.6.0 |
Driver ≥525 is not negotiable. Older drivers do not support the PTX ISA features vLLM relies on for CUDA graph compilation. On Ubuntu:
nvidia-smi # Check current driver version
# If < 525, update:
sudo apt update
sudo apt install nvidia-driver-535 # Or latest available
sudo reboot
CUDA ≥12.1: vLLM containers bundle CUDA, so your host CUDA version matters less. But if you are running vLLM directly (not containerized), verify:
nvcc --version # Check CUDA toolkit version
# If absent or old:
# Install from https://developer.nvidia.com/cuda-downloads
Container Toolkit ≥1.14: Allows Docker containers to access NVIDIA GPUs. Install or upgrade:
# Ubuntu / Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install nvidia-docker2
sudo systemctl restart docker
Verify:
docker run --rm --gpus all nvidia/cuda:12.1.0-runtime-ubuntu22.04 nvidia-smi
All GPUs should appear. If not, the Container Toolkit is not wired correctly.
How many GPUs do you actually need for your model?
The straightforward answer is determined by VRAM and the tensor-parallel-size divisibility constraint.
1 GPU (single 24GB card like RTX 3090 or 4090):
- 7B–13B models: easy.
- 34B models: tight, requires Q4 or aggressive offloading.
- 70B models: does not fit.
2 GPUs (2×24GB, e.g., dual RTX 3090):
- 7B–13B models: no need for tensor parallelism; single-GPU serving is simpler. Use the second card for batching or other workloads.
- 34B models: tensor-parallel-size=2, comfortable.
- 70B models: tensor-parallel-size=2, works.
- 70B+ models: begins to crowd VRAM for long contexts.
4 GPUs (4×24GB):
- 70B models: tensor-parallel-size=4 if model has ≥4 attention heads. Plenty of room for concurrent requests.
- 70B–140B models: serious consideration.
3 GPUs: Avoid. Head counts rarely divide by 3. You will waste a card or hit configuration errors.
For a production setup under load (multiple concurrent users), the sweet spot for consumer hardware is 2×24GB RTX 3090 (see the dual-3090 build guide for the full rig) or upgrade to datacenter hardware. If you can rent GPUs, a RunPod 2×A100 or 2×H100 pod is a good way to test the math before buying.
Configuring vLLM: Docker setup guide
The cleaner, more reliable path is to run vLLM in Docker. The official image includes all CUDA bindings and avoids local CUDA version mismatches.
Step 1: Pull the vLLM image
docker pull vllm/vllm-openai:latest
# or a specific version:
docker pull vllm/vllm-openai:v0.6.0
Step 2: Download your model
# Example: Llama 3 70B Q4 variant (gguf format from Hugging Face)
huggingface-cli download Xwin-LM/Llama-3-70b-Q4_K_M --local-dir ./models
# Or use a full-precision model:
huggingface-cli download meta-llama/Llama-2-70b --local-dir ./models
Step 3: Run vLLM with tensor parallelism
docker run -d \
--gpus all \
-v ./models:/models \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model /models/Llama-2-70b \
--tensor-parallel-size 2 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
Key flags:
--tensor-parallel-size 2: Use 2 GPUs (must divide model’s attention-head count).--max-model-len 4096: Maximum context length. Adjust based on VRAM headroom.--gpu-memory-utilization 0.85: Fraction of VRAM to use for the model. 0.85 leaves 15% headroom (~3.6 GB on a 24 GB card). Conservative but safe.--dtype float16: Use FP16 (halves VRAM vs FP32). vLLM defaults to FP16 anyway.
Step 4: Query the API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-2-70b",
"prompt": "What is the capital of France?",
"max_tokens": 100,
"temperature": 0.7
}'
Common configuration mistakes and fixes
| Mistake | Symptom | Fix |
|---|---|---|
tensor-parallel-size not divisible by model’s attention-head count | vLLM fails to load model or hangs | Check model config.json, set divisible tensor-parallel-size |
| Insufficient KV-cache headroom | OOM on 2nd concurrent request, pod crashes | Reduce --max-model-len or --gpu-memory-utilization, or add headroom formula above |
CUDA_VISIBLE_DEVICES not set | vLLM sees all GPUs but config uses only one | Set --tensor-parallel-size or use --gpus '"all"' in Docker |
| Old driver (< 525) | CUDA errors, PTX compilation fails | Upgrade to driver ≥525 |
| vLLM inside container sees 0 GPUs | Container runs but CPU-only | Verify nvidia-docker2 installed, docker run --gpus all works with test image |
Bottom line
vLLM makes multi-GPU serving accessible, but it is not a “install and forget” tool. The two gotchas — tensor-parallel-size divisibility and KV-cache headroom — are easy to get right if you check them before deployment. A production setup should:
- Verify model’s attention-head count before buying GPUs.
- Budget VRAM: model weights + 1 GB driver + KV-cache buffer + 1 GB headroom per card.
- Run a dry run with Docker and a ~4K context to confirm no OOM.
- Deploy with
--gpu-memory-utilization < 0.9to absorb surprises.
For most single-user or small-team inference, a 2×24GB RTX 3090 with tensor-parallel-size=2 is the practical sweet spot. If you need to test before buying, rent a multi-GPU pod on RunPod or another provider for a few dollars an hour.
Who This Is NOT For
This guide is for local, self-hosted vLLM deployments on consumer or small datacenter hardware. It is not the right guide if:
- You are using a managed serving platform (Together.ai, Replicate, Anyscale). They handle tensor parallelism and VRAM budgeting behind the scenes. This guide is for when you own the infrastructure.
- You are training or fine-tuning. That workload has different VRAM and communication patterns. vLLM is an inference engine; for training, see the appropriate fine-tuning docs.
- You have only one GPU and want to use tensor parallelism. Single-GPU serving is simpler, lower-latency, and does not require inter-GPU communication. Use it.
- You want to switch between single and multi-GPU serving without rebuilding. You can, but the model compile and CUDA graph differ between configurations; plan for deployment restarts.