GPU VRAM Tiers for Local AI: What 8, 12, 16, 24, 48GB Actually Buy You
You walk into a GPU store (or eBay) with a model you want to run locally. You know you need enough VRAM, but how much is enough? “Get 24GB” is advice you hear everywhere, but it is not always the answer — and it overshoots the budget for someone running 7B models, while it undershoots the constraint for someone chasing 70B.
The problem is that “best GPU” and “biggest VRAM” are conflated. What you actually need is a tier — a VRAM capacity that serves a specific class of models at usable quantization, with the lowest-cost GPU card that hits that tier.
This guide cuts through the noise. It maps five VRAM tiers — 8GB, 12GB, 16GB, 24GB, 48GB — to the exact model sizes they serve, gives you the math to verify the fit, and names the best-value GPU per tier. If you have a model in mind, read the tier, find the card, and move on. If you do not have a model yet, pick the tier that matches your target and the tier selects the card.
The VRAM heuristic: ~2GB per 1B parameters, quartered by Q4
The math is straightforward enough that you never have to guess again.
Start here: A language model at full FP16 precision (16-bit floats) needs roughly 2 GB per 1 billion parameters. A 7B model at FP16 is ~14 GB. A 13B model is ~26 GB. A 70B model is ~140 GB. Those numbers are why we quantize — and why quantization level is the actual VRAM lever.
Q4 quantization cuts that by ~¾. If FP16 needs 2 GB/1B, then Q4 (4-bit quantization) needs roughly 0.5 GB/1B. So:
- 7B at Q4: ~3–5 GB (depending on the exact quantization variant)
- 13B at Q4: ~7–8 GB
- 32B at Q4: ~16–18 GB
- 70B at Q4: ~35–40 GB
Add 1–2 GB for runtime overhead and context cache. This is where your “safety margin” lives. You need room for the KV cache (the model’s attention memory for your conversation context), inference overhead, and a few generations without eviction.
This heuristic is community-cited (r/LocalLLaMA, llama.cpp benchmarks, 2024–2025) and not independently verified by LocalRig. Use it to plan; then check your specific model’s reported size before buying.
Tier-by-tier breakdown: model class → GPU
Here is what each VRAM tier unlocks:
8GB: niche tier (older 7B models, edge cases)
8GB handles older 7B models at aggressive quantization (Q2, Q3) or the rare very-compact 6B model at Q4. It is tight for anything newer, because modern 7B models and their quantization variants tend to land at 4–5 GB at Q4, leaving only 3–4 GB for context and overhead.
When to consider 8GB: You have a specific small model you have tested, your context is short (≤2K tokens), and you are price-constrained below ~$250. Otherwise, 12GB is the real floor.
Best card: RTX 3050 6GB (cheap, used ~$80–120). But you are bumping against the ceiling; the moment you want a larger model or longer context, you buy again. Not recommended unless the budget is absolutely fixed.
12GB: the entry point (7B–8B models)
12GB is where local LLM inference becomes genuinely practical. A 7B model at Q4_K_M (~4–5 GB) leaves ~6–7 GB for context (4K+ tokens easily) and overhead. An 8B model at Q4 (~4–5 GB) has similar headroom. You are not fighting the hardware.
13B models technically fit (Q4 is ~7–8 GB), but you have only ~3–4 GB free — workable, but no buffer for longer contexts or batching. If 13B is in your sights, jump to 16GB.
When to consider 12GB: Your model is 7B or smaller, your context is typical (≤4K tokens), and you are budget-constrained. It is the honest entry point.
Best-value card: RTX 3060 12GB — roughly $250–350 new, or ~$150–250 used. Delivers ~40–60 tok/s on a 7B model (community-cited, not independently verified by LocalRig). Low power (~170W), reliable, but understand the 12GB ceiling before you buy.
Used listings: eBay RTX 3060 12GB
16GB: the mid-model tier (13B–16B models)
16GB is the tier that unlocks the mid-size model class. A 13B model at Q4 (~7–8 GB) now has real headroom. A 14B model at Q4 (~7–8 GB) fits comfortably with ~7–8 GB for context and overhead. You can also run a 7B at Q8_0 (higher quality, ~8 GB) and still have room to breathe.
This is a real tier boundary, not a marketing increment. The jump from 12 to 16GB lets you drop the “13B is tight” caveat and run your mid-size model at full quality.
When to consider 16GB: Your target is a 13B–16B model, or you want a 7B model at Q8 quality with a large context window. This is the tier where you stop making trade-offs.
Best-value card: RTX 5060 Ti 16GB — new, ~$400–500. Delivers ~60–80 tok/s on a 7B model (community-cited, not independently verified by LocalRig). It is the current-gen sweet spot: 16GB at moderate power (~260W) and a price that beats older high-VRAM cards per tok/s. Also consider a used RTX 4060 Ti 16GB if available at ~$250–350 used.
Used RTX 4060 Ti: eBay
24GB: the standard (7B–32B models)
24GB is the tier where you stop fighting the hardware entirely. A 7B model at Q8_0 or higher quality is fully comfortable. A 13B model at Q8 fits. A 32B model at Q4 (~16–18 GB) has ~5–6 GB for context and overhead. You can also fit two smaller models, or run longer contexts without anxiety.
This is the tier the community converges on for “I want to run local LLMs and not think about VRAM again for a while.” Most 32B and smaller models simply fit. Beyond 32B, you run into the ceiling.
When to consider 24GB: You want to run a 32B model, or you want a 7B–13B model with high quality (Q8+) and long context. This is the last tier where single-card inference is standard.
Best-value card: Used RTX 3090 24GB — ~$500–800 used (eBay). Delivers 80–110 tok/s on a 7B Q4 model (community-cited, not independently verified by LocalRig). It is discontinued and only sold secondhand, but the VRAM-per-dollar is unbeaten. Community standard for this workload. New alternative: RTX 4090 24GB ($1,800–2,200 new) at ~120–160 tok/s (community-cited, not independently verified by LocalRig) — faster, but the price premium is steep for the same VRAM ceiling.
48GB: big models (70B and beyond)
48GB unlocks the 70B-class models at Q4 quantization (~35–40 GB), leaving ~8–10 GB for context and overhead. You cannot fit much larger than 70B without going dual-48GB or A100+ territory.
48GB comes as a single card or dual 24GB:
- Single A6000 48GB (~$2,500–4,000 used) — enterprise card, 384 GB/s bandwidth, handles 70B Q4 with ~100+ tok/s (community-cited, not independently verified by LocalRig). Overkill for inference if you have no production requirement, but if you are serious about 70B, this is the simplest path.
- Dual RTX 3090 (48GB total) — two 24GB cards communicate over PCIe, no linear speed scaling. You buy capacity, not 2× throughput. Practical only for 70B when single-card options are unavailable or when you need failover. See best GPU for local LLM for multi-GPU reality.
When to consider 48GB: Your model is 70B or larger, or you are serving multiple concurrent users and need the VRAM buffer. Otherwise, you are overspending.
Best-value card (used): Used RTX A6000 48GB — look for ~$2,500–3,500 used, depending on condition and market. Dual used 3090 alternative: eBay search (buy two) — roughly $500–800 each, so ~$1,000–1,600 total for 48GB, but lower bandwidth and no NVLink means you are not getting proportional speed.
Master reference table
| VRAM Tier | Example Models | Model Size Range | Best Quantization | Best-Value GPU | Price Range | Notes |
|---|---|---|---|---|---|---|
| 8GB | TinyLlama, older 6B | <6B | Q2, Q3 | RTX 3050 | ~$80–120 used | Niche; tight headroom. Not recommended unless budget is fixed. |
| 12GB | Llama 2 7B, Phi 2.7B | 7B–8B | Q4_K_M, Q8 | RTX 3060 | $150–250 used, $250–350 new | Sweet spot for 7B. 13B is technically possible but tight. |
| 16GB | Mistral 7B, Llama 2 13B | 13B–16B | Q4, Q8 | RTX 5060 Ti, RTX 4060 Ti 16GB | $250–350 used, $400–500 new | Unlocks mid-size models at full quality. Real tier boundary. |
| 24GB | Llama 2 13B, Mixtral 8x7B, Llama 3 32B | 7B–32B | Q8, Q4 | Used RTX 3090 | $500–800 used, $1,800+ new (4090) | Community standard. Fits most single-card inference workloads. |
| 48GB | Llama 2 70B, Mixtral 8x22B | 32B–70B | Q4, Q8 | Used A6000, Dual RTX 3090 | $2,500–3,500 (A6000) or $1,000–1,600 (2x3090) | Big models only. Dual cards buy capacity, not 2× speed. |
All tok/s figures are community-cited (r/LocalLLaMA, llama.cpp benchmarks, 2024–2025), not independently verified by LocalRig except the base Apple M4 number. Treat them as planning ranges.
How to pick your tier: the constraint logic
You have three levers; they always come in order:
-
What model do you want to run? Look it up, find its VRAM requirement in the table above or via the VRAM calculator, and pick the tier that fits it comfortably (with ~2 GB headroom for context). If your model is 13B–16B and you were thinking 12GB, jump to 16GB. Do not fight the boundary.
-
How much can you spend, and how long will this GPU last? 12GB is the absolute entry point for local LLMs at 2026 pricing. 24GB used (3090) is the long-term value play — it does not age out the moment you want a bigger model. 16GB is the middle ground. If you have the budget and intend to keep the GPU for 3+ years, 24GB makes more financial sense than upgrading from 12GB.
-
New or used? New cards (3060, 5060 Ti, 4090) carry warranty. Used cards (3090, A6000) carry risk but lower cost. If you are buying used, budget for repaste and assume no warranty. See best GPU for local LLM for detailed used-buying safety.
Do not optimize for “coolness factor” or raw FLOPS. Optimize for: Does my model fit? Is the price fair for this tier? Is there a clear upgrade path? If you answer yes to all three, you have picked right.
When to use the VRAM calculator
For exact sizing on your specific model, use the VRAM calculator. It accounts for quantization variants (Q4_K_M vs Q4_K_S vs Q8_0) and reports the exact byte count. The heuristic here (2 GB/1B FP16, quartered by Q4) is a planning tool; the calculator is the verification step.
Build and sizing context
This page is the VRAM tier selector — it tells you which GPU capacity you need. For the full decision tree (PSU, cooling, multi-GPU trade-offs, when to rent instead), see:
- The Local-AI Hardware Buying Framework — constraint logic
- Can I Run a 70B Model Locally? — big-model reality
- Build Planner — PSU, cooling, and system integration
Bottom line
VRAM tier is not about “bigger is always better.” It is about fit. Buy the tier your model needs, with 1–2 GB of headroom for context and overhead. A 12GB GPU running a 7B model will not suddenly become better by swapping it for a 24GB card — you are still capped by the model size and bandwidth. But a 12GB GPU struggling to fit a 13B model becomes usable the moment you move to 16GB.
The tiers are real boundaries. Honor them, and you buy once. Cross them, and you buy again in six months. Use the table, plug in your target model, and pick your tier. The cards follow from there.