Why VRAM (and Memory Bandwidth) Matters More Than Compute for Local LLMs
The GPU market for local LLM inference is puzzling if you trust traditional specs. A five-year-old Tesla P40 or AMD MI50 with 24GB GDDR5 often runs circles around a flashy new RTX 4060 Ti with 16GB GDDR6. A used RTX 3090 at $500–$800 delivers better token-per-second throughput than a new RTX 4070 Ti Super at three times the price. The reason is not hype or nostalgia — it is a fundamental property of how language models execute on GPUs.
Token generation is memory-bandwidth-bound, not compute-bound. When a model generates text, it reads the entire weight matrix from memory to produce each token. Compute happens fast; data movement is the bottleneck. This single fact explains why large VRAM and high memory bandwidth trump raw FLOPS for decoding, why old cards with fat VRAM beat new cards with thinner VRAM, and why the market for used enterprise and gaming GPUs is so vigorous. It also explains why Apple Silicon can feel slow despite impressive spec sheets — bandwidth, not compute power, sets the ceiling.
This is the spine of the used-GPU buying decision. If you understand it, you stop overpaying for raw teraflops and start buying the right constraints.
The three constraints: capacity, bandwidth, compute
Most buyers conflate these. They are separate problems.
Capacity (VRAM): Does the model fit in the card’s memory at all? A 7B model in FP16 full precision is roughly 14GB (the ~2GB per billion parameters heuristic). At Q4_K_M quantization it is ~4GB. A 70B model is ~140GB FP16, ~35GB Q4. If the VRAM is smaller than the model, the model either fails to load or offloads layers to system RAM over PCIe — and PCIe is slow enough (16 GB/s theoretical, much lower in practice) that throughput collapses. You cannot work around a capacity miss.
Bandwidth (GB/s): Given that the model fits, how fast can you read its weights? An RTX 3090 has 936 GB/s of VRAM bandwidth. A Tesla P40 has 346 GB/s. An RTX 4060 Ti has 432 GB/s. Generating a token requires reading the entire weight matrix once (for single-token generation; some implementations read it multiple times). More bandwidth means more tokens per second. This is why a P40 is often slower than a 3090 despite similar VRAM — the bandwidth gap is direct and large.
Compute (FLOPS): The raw arithmetic. An RTX 4090 has ~1,456 TFLOPS (FP32 mixed-precision). An RTX 3090 has ~696 TFLOPS. For inference, especially token generation, this gap means almost nothing. The ALUs finish their work in microseconds; waiting for memory takes milliseconds. You are paying for idle compute.
The proof is empirical: community benchmarks (r/LocalLLaMA, llama.cpp GitHub issues, 2024–2025) show decode speed correlates strongly with bandwidth, weakly with FLOPS. An RTX 4070 Ti Super has more teraflops than a used RTX 3090 — but lower bandwidth. In practice, the 3090 decodes faster. That is not an anomaly; it is the normal state of affairs.
Why decode is memory-bandwidth-bound (not just asserted)
The reasoning is straightforward. Generating one token requires:
- Load the full weight matrix from VRAM (e.g., for a 7B model, ~14GB at FP16, ~4GB at Q4).
- Perform a matrix-vector multiply against the input (the token vector, 4KB).
- Compute softmax and sample the next token.
Steps 2 and 3 are fast — microseconds on modern ALUs. Step 1 is slow — milliseconds. For a 4GB model at 936 GB/s bandwidth (RTX 3090), reading takes ~4.3ms. Compute takes ~0.1ms. The bottleneck is not compute.
Moreover, the compute-per-byte ratio (arithmetic intensity) for transformer decode is low. You read the full weight matrix but do only one matrix-vector multiply — a single pass, not a batched operation. Compare this to training (which reads weights many times per batch) or prefill (which processes a long sequence at once, amortizing the memory cost across many compute operations). Decode is inherently memory-bound.
This is why Apple M3 Max with 400 GB/s unified memory bandwidth can still feel slower than an RTX 3090 at 936 GB/s for pure decode throughput — even with top-tier compute. Bandwidth is king. It is also why the base Apple M4 with 16GB — half the bandwidth of the M3 Max — lands at only 18.4 tok/s (llama.cpp b9820) on Llama 3.1 8B Q4_K_M, measured 2026-06-27 by LocalRig. The M4 has more raw compute than many older GPUs, but less bandwidth, and bandwidth rules decode.
The VRAM and bandwidth heuristic: why old cards win
Older GPUs — especially enterprise and mining-era cards — were built with fat VRAM and wide memory buses. The RTX 3090 (2020) has 24GB GDDR6X at 936 GB/s. The Tesla P40 (2016, enterprise) has 24GB GDDR5 at 346 GB/s. The AMD MI50 (2018, datacenter) has 32GB HBM2 at 1,024 GB/s. These were designed when memory bandwidth was taken seriously.
Newer consumer cards cut VRAM to keep costs down. The RTX 4060 Ti has 16GB at 432 GB/s. The RTX 4070 has 12GB (baseline) at 576 GB/s. These are efficient cards for gaming and professional graphics, where you work with smaller datasets per frame. For language models, the VRAM ceiling bites hard. A 13B model at Q4_K_M is ~8GB; the 4070’s 12GB leaves almost no margin for context or batching. You lose the freedom to run larger models or tune settings.
The result: a used RTX 3090 at $500–$800 delivers more decode throughput and more VRAM headroom than new consumer cards at twice the price. It is not nostalgia — it is the absence of bandwidth-cutting in older designs. Enterprise cards like the P40 or MI50 command even higher prices on the used market because they were built for tasks (like datacenter inference) that demand bandwidth, and that demand has not gone away.
For a detailed model-by-model buying guide, see Best GPU for Local LLM Inference. This page is the why behind those choices.
The ~2GB per billion parameters rule
Model size scales with parameter count. A rough heuristic used across the community:
At FP16 full precision: 1 billion parameters ≈ 2GB VRAM.
So:
- 7B model: ~14GB
- 13B model: ~26GB
- 32B model: ~64GB
- 70B model: ~140GB
Quantization compresses this. Q4_K_M (the most common balance of quality and size) cuts the requirement roughly in half:
- 7B Q4_K_M: ~4GB
- 13B Q4_K_M: ~8GB
- 70B Q4_K_M: ~18–20GB
Q8_0 (higher quality) takes ~1 byte per parameter, splitting the difference. The heuristic is not precise — it depends on the model architecture, runtime overhead, and context-window size — but it is consistent enough to drive VRAM sizing. Before picking a card, run this backward: decide what model and quantization you want, calculate the VRAM need, then find the card that fits with headroom.
The VRAM Calculator does this for you. The Buying Framework walks the constraint logic.
Apple’s split: strong decode at scale, weak prefill
Apple Silicon is an interesting case. The M3 Max and M4 Max with large unified memory can fit models that do not fit on 24GB consumer GPUs. But the decode speed is not proportional to VRAM. Why?
Because Macs have lower memory bandwidth than NVIDIA’s GDDR6X or enterprise HBM2. The M3 Max has 400 GB/s; the RTX 3090 has 936 GB/s. For pure token-generation speed, the NVIDIA card wins. But Apple’s unified memory architecture excels at one thing: prefill (reading the entire prompt at once). Prefill is compute-bound if you can keep the GPU fed; unified memory lets Apple process long contexts (8K+ tokens) without the PCIe penalty of offloading. That is why the Mac decode speed advantage is misleading: Macs feel fast for the initial read-in of your prompt, then slower for the token output stream.
For workloads that emphasize long-context reading (research, document analysis), Macs shine. For sustained token generation (coding, chat), NVIDIA cards with higher bandwidth are faster. This is why the 7B/8B hardware guide treats them as different tiers, not competitors.
Side-by-side: VRAM and bandwidth show the shape of the market
| GPU | Year | VRAM | Bandwidth | Est. decode speed (7B Q4) | Current price range | Notes |
|---|---|---|---|---|---|---|
| RTX 4090 | 2022 | 24GB | 936 GB/s | ~120–160 tok/s | ~$1,600 (new) | Fastest consumer card; no VRAM scaling benefit |
| RTX 3090 | 2020 | 24GB | 936 GB/s | ~80–110 tok/s | ~$500–$800 (used) | Community standard; value-for-throughput |
| Tesla P40 | 2016 | 24GB | 346 GB/s | ~30–50 tok/s | ~$200–$400 (used) | Lower bandwidth but cheap; good budget route |
| AMD MI50 | 2018 | 32GB | 1,024 GB/s | ~100–140 tok/s | ~$600–$900 (used) | Highest bandwidth; less available, ROCm support required |
| RTX 4070 Ti Super | 2024 | 16GB | 576 GB/s | ~60–80 tok/s | ~$700–$800 (new) | Newer, less VRAM; bandwidth-efficient but VRAM-constrained |
| RTX 3060 | 2021 | 12GB | 360 GB/s | ~40–60 tok/s | ~$200–$300 (new/used) | Budget floor; small context ceilings |
All throughput figures are community-cited (r/LocalLLaMA, llama.cpp threads, 2024–2025), not independently verified by LocalRig. Actual speeds depend on CUDA version, runtime, thermal state, and PCIe configuration. These are planning ranges.
Notice the pattern: VRAM and bandwidth, not compute or release year, predict decode speed. The 2016 P40 is slow because of low bandwidth, not because it is old. The 2024 4070 Ti Super is faster per watt but hits a VRAM ceiling that older 24GB cards do not. The AMD MI50, nearly a decade old, still ranks among the fastest because it has the highest bandwidth of any consumer/enterprise card.
When compute matters: prefill, not decode
Compute does matter—just not for the token-stream you are waiting for. Prefill (the initial pass over the entire prompt) is compute-bound if the sequence is long and the GPU stays fed. An RTX 4090 with ~2× the compute of a 3090 will process a long prompt faster. But this is the minority of inference wall-time for interactive chat and small documents. Most time is spent generating tokens one-by-one, and that is bandwidth-bound.
This is why the phrase “token generation is memory-bandwidth-bound” is true, even though it sounds like a defect. It is not a defect — it is the shape of the problem. And understanding it flips the entire market: instead of chasing the newest compute, you buy the widest VRAM and bandwidth for your budget, and the token output stream becomes a non-negotiable constant rather than a bottleneck you optimize for every six months.
Bottom line
When you are shopping for a GPU to run local LLMs, the spec sheet will tempt you with FLOPS, TensorRT, new tensor cores, and year-over-year compute gains. Ignore almost all of it.
Ask three questions:
- Does the model fit? VRAM first. If it does not fit, nothing else matters.
- How fast will it generate tokens? Memory bandwidth. Higher bandwidth = faster decode.
- What do I pay for each GB/s? Divide the price by the bandwidth. A used 3090 often wins.
Compute matters if you are training or running very long prompts at scale. For local inference on your machine — chat, coding, document work — bandwidth rules. That is why a 2020 card with fat memory often outperforms a 2024 card with thin memory, and why the used GPU market is thriving when new GPUs keep shrinking VRAM to hit price points.
Buy bandwidth, not hype. Your token stream will thank you.
See also
- Best GPU for Local LLM Inference — the product picks, by constraint
- What Is Quantization — how to calculate VRAM need from model size
- VRAM Calculator — plug in your model and see how much memory you need
- Why Prompt Processing is Slow on Mac — why Apple’s decode speed advantage is misunderstood
- Tesla P40 for Local LLM — the budget-to-mid-range enterprise card deep-dive