How much VRAM do I need to run a 7B model?

12GB is the comfortable floor. A 7B model at Q4_K_M (high quality) needs ~4–5 GB, leaving room for context and runtime overhead. 8GB is technically possible at Q3 but tight; 12GB gives you usable headroom.

Can I run a 13B model on 12GB?

13B at Q4_K_M is ~7–8 GB, so it fits on 12GB with ~3–4 GB free. It works, but there is little buffer for large contexts or batching. 16GB is the honest recommendation for 13B as your primary model.

How many GB do I need for a 70B model?

At Q4 quantization, 70B models need roughly 35–40 GB. This requires either dual 24GB GPUs (~48 GB total) with tensor parallelism, or a single A6000 (48 GB) or H100 (80 GB). 48GB is the practical floor for 70B-class models.

What is the VRAM heuristic I should use to size my GPU?

Rule of thumb: ~2 GB per 1 billion parameters at FP16 precision. Q4 quantization cuts that to ~0.5 GB per 1B. Add 1–2 GB for runtime overhead and context cache. This gives you a starting point; verify against your target model's measured size.

Is a 16GB GPU better than 12GB, or is it just marketing?

The jump from 12 to 16GB is real. It lets you comfortably run 14B models at Q4, whereas 12GB at 13B leaves little headroom. It is the tier that unlocks the mid-size model class (13B–16B) without constant trade-offs.

GPU VRAM Tiers for Local AI: What 8, 12, 16, 24, 48GB Actually Buy You

You walk into a GPU store (or eBay) with a model you want to run locally. You know you need enough VRAM, but how much is enough? “Get 24GB” is advice you hear everywhere, but it is not always the answer — and it overshoots the budget for someone running 7B models, while it undershoots the constraint for someone chasing 70B.

The problem is that “best GPU” and “biggest VRAM” are conflated. What you actually need is a tier — a VRAM capacity that serves a specific class of models at usable quantization, with the lowest-cost GPU card that hits that tier.

This guide cuts through the noise. It maps five VRAM tiers — 8GB, 12GB, 16GB, 24GB, 48GB — to the exact model sizes they serve, gives you the math to verify the fit, and names the best-value GPU per tier. If you have a model in mind, read the tier, find the card, and move on. If you do not have a model yet, pick the tier that matches your target and the tier selects the card.

The VRAM heuristic: ~2GB per 1B parameters, quartered by Q4

The math is straightforward enough that you never have to guess again.

Start here: A language model at full FP16 precision (16-bit floats) needs roughly 2 GB per 1 billion parameters. A 7B model at FP16 is ~14 GB. A 13B model is ~26 GB. A 70B model is ~140 GB. Those numbers are why we quantize — and why quantization level is the actual VRAM lever.

Q4 quantization cuts that by ~¾. If FP16 needs 2 GB/1B, then Q4 (4-bit quantization) needs roughly 0.5 GB/1B. So:

7B at Q4: ~3–5 GB (depending on the exact quantization variant)
13B at Q4: ~7–8 GB
32B at Q4: ~16–18 GB
70B at Q4: ~35–40 GB

Add 1–2 GB for runtime overhead and context cache. This is where your “safety margin” lives. You need room for the KV cache (the model’s attention memory for your conversation context), inference overhead, and a few generations without eviction.

This heuristic is community-cited (r/LocalLLaMA, llama.cpp benchmarks, 2024–2025) and not independently verified by LocalRig. Use it to plan; then check your specific model’s reported size before buying.

Tier-by-tier breakdown: model class → GPU

Here is what each VRAM tier unlocks:

8GB: niche tier (older 7B models, edge cases)

8GB handles older 7B models at aggressive quantization (Q2, Q3) or the rare very-compact 6B model at Q4. It is tight for anything newer, because modern 7B models and their quantization variants tend to land at 4–5 GB at Q4, leaving only 3–4 GB for context and overhead.

When to consider 8GB: You have a specific small model you have tested, your context is short (≤2K tokens), and you are price-constrained below ~$250. Otherwise, 12GB is the real floor.

Best card: RTX 3050 6GB (cheap, used ~$80–120). But you are bumping against the ceiling; the moment you want a larger model or longer context, you buy again. Not recommended unless the budget is absolutely fixed.

12GB: the entry point (7B–8B models)

12GB is where local LLM inference becomes genuinely practical. A 7B model at Q4_K_M (~4–5 GB) leaves ~6–7 GB for context (4K+ tokens easily) and overhead. An 8B model at Q4 (~4–5 GB) has similar headroom. You are not fighting the hardware.

13B models technically fit (Q4 is ~7–8 GB), but you have only ~3–4 GB free — workable, but no buffer for longer contexts or batching. If 13B is in your sights, jump to 16GB.

When to consider 12GB: Your model is 7B or smaller, your context is typical (≤4K tokens), and you are budget-constrained. It is the honest entry point.

Best-value card: RTX 3060 12GB — roughly $250–350 new, or ~$150–250 used. Delivers ~40–60 tok/s on a 7B model (community-cited, not independently verified by LocalRig). Low power (~170W), reliable, but understand the 12GB ceiling before you buy.

Used listings: eBay RTX 3060 12GB

16GB: the mid-model tier (13B–16B models)

16GB is the tier that unlocks the mid-size model class. A 13B model at Q4 (~7–8 GB) now has real headroom. A 14B model at Q4 (~7–8 GB) fits comfortably with ~7–8 GB for context and overhead. You can also run a 7B at Q8_0 (higher quality, ~8 GB) and still have room to breathe.

This is a real tier boundary, not a marketing increment. The jump from 12 to 16GB lets you drop the “13B is tight” caveat and run your mid-size model at full quality.

When to consider 16GB: Your target is a 13B–16B model, or you want a 7B model at Q8 quality with a large context window. This is the tier where you stop making trade-offs.

Best-value card: RTX 5060 Ti 16GB — new, ~$400–500. Delivers ~60–80 tok/s on a 7B model (community-cited, not independently verified by LocalRig). It is the current-gen sweet spot: 16GB at moderate power (~260W) and a price that beats older high-VRAM cards per tok/s. Also consider a used RTX 4060 Ti 16GB if available at ~$250–350 used.

Used RTX 4060 Ti: eBay

24GB: the standard (7B–32B models)

24GB is the tier where you stop fighting the hardware entirely. A 7B model at Q8_0 or higher quality is fully comfortable. A 13B model at Q8 fits. A 32B model at Q4 (~16–18 GB) has ~5–6 GB for context and overhead. You can also fit two smaller models, or run longer contexts without anxiety.

This is the tier the community converges on for “I want to run local LLMs and not think about VRAM again for a while.” Most 32B and smaller models simply fit. Beyond 32B, you run into the ceiling.

When to consider 24GB: You want to run a 32B model, or you want a 7B–13B model with high quality (Q8+) and long context. This is the last tier where single-card inference is standard.

Best-value card: Used RTX 3090 24GB — ~$500–800 used (eBay). Delivers 80–110 tok/s on a 7B Q4 model (community-cited, not independently verified by LocalRig). It is discontinued and only sold secondhand, but the VRAM-per-dollar is unbeaten. Community standard for this workload. New alternative: RTX 4090 24GB ($1,800–2,200 new) at ~120–160 tok/s (community-cited, not independently verified by LocalRig) — faster, but the price premium is steep for the same VRAM ceiling.

48GB: big models (70B and beyond)

48GB unlocks the 70B-class models at Q4 quantization (~35–40 GB), leaving ~8–10 GB for context and overhead. You cannot fit much larger than 70B without going dual-48GB or A100+ territory.

48GB comes as a single card or dual 24GB:

Single A6000 48GB (~$2,500–4,000 used) — enterprise card, 384 GB/s bandwidth, handles 70B Q4 with ~100+ tok/s (community-cited, not independently verified by LocalRig). Overkill for inference if you have no production requirement, but if you are serious about 70B, this is the simplest path.
Dual RTX 3090 (48GB total) — two 24GB cards communicate over PCIe, no linear speed scaling. You buy capacity, not 2× throughput. Practical only for 70B when single-card options are unavailable or when you need failover. See best GPU for local LLM for multi-GPU reality.

When to consider 48GB: Your model is 70B or larger, or you are serving multiple concurrent users and need the VRAM buffer. Otherwise, you are overspending.

Best-value card (used): Used RTX A6000 48GB — look for ~$2,500–3,500 used, depending on condition and market. Dual used 3090 alternative: eBay search (buy two) — roughly $500–800 each, so ~$1,000–1,600 total for 48GB, but lower bandwidth and no NVLink means you are not getting proportional speed.

Master reference table

VRAM Tier	Example Models	Model Size Range	Best Quantization	Best-Value GPU	Price Range	Notes
8GB	TinyLlama, older 6B	<6B	Q2, Q3	RTX 3050	~$80–120 used	Niche; tight headroom. Not recommended unless budget is fixed.
12GB	Llama 2 7B, Phi 2.7B	7B–8B	Q4_K_M, Q8	RTX 3060	$150–250 used, $250–350 new	Sweet spot for 7B. 13B is technically possible but tight.
16GB	Mistral 7B, Llama 2 13B	13B–16B	Q4, Q8	RTX 5060 Ti, RTX 4060 Ti 16GB	$250–350 used, $400–500 new	Unlocks mid-size models at full quality. Real tier boundary.
24GB	Llama 2 13B, Mixtral 8x7B, Llama 3 32B	7B–32B	Q8, Q4	Used RTX 3090	$500–800 used, $1,800+ new (4090)	Community standard. Fits most single-card inference workloads.
48GB	Llama 2 70B, Mixtral 8x22B	32B–70B	Q4, Q8	Used A6000, Dual RTX 3090	$2,500–3,500 (A6000) or $1,000–1,600 (2x3090)	Big models only. Dual cards buy capacity, not 2× speed.

All tok/s figures are community-cited (r/LocalLLaMA, llama.cpp benchmarks, 2024–2025), not independently verified by LocalRig except the base Apple M4 number. Treat them as planning ranges.

How to pick your tier: the constraint logic

You have three levers; they always come in order:

What model do you want to run? Look it up, find its VRAM requirement in the table above or via the VRAM calculator, and pick the tier that fits it comfortably (with ~2 GB headroom for context). If your model is 13B–16B and you were thinking 12GB, jump to 16GB. Do not fight the boundary.
How much can you spend, and how long will this GPU last? 12GB is the absolute entry point for local LLMs at 2026 pricing. 24GB used (3090) is the long-term value play — it does not age out the moment you want a bigger model. 16GB is the middle ground. If you have the budget and intend to keep the GPU for 3+ years, 24GB makes more financial sense than upgrading from 12GB.
New or used? New cards (3060, 5060 Ti, 4090) carry warranty. Used cards (3090, A6000) carry risk but lower cost. If you are buying used, budget for repaste and assume no warranty. See best GPU for local LLM for detailed used-buying safety.

Do not optimize for “coolness factor” or raw FLOPS. Optimize for: Does my model fit? Is the price fair for this tier? Is there a clear upgrade path? If you answer yes to all three, you have picked right.

When to use the VRAM calculator

For exact sizing on your specific model, use the VRAM calculator. It accounts for quantization variants (Q4_K_M vs Q4_K_S vs Q8_0) and reports the exact byte count. The heuristic here (2 GB/1B FP16, quartered by Q4) is a planning tool; the calculator is the verification step.

Build and sizing context

This page is the VRAM tier selector — it tells you which GPU capacity you need. For the full decision tree (PSU, cooling, multi-GPU trade-offs, when to rent instead), see:

The Local-AI Hardware Buying Framework — constraint logic
Can I Run a 70B Model Locally? — big-model reality
Build Planner — PSU, cooling, and system integration

Bottom line

VRAM tier is not about “bigger is always better.” It is about fit. Buy the tier your model needs, with 1–2 GB of headroom for context and overhead. A 12GB GPU running a 7B model will not suddenly become better by swapping it for a 24GB card — you are still capped by the model size and bandwidth. But a 12GB GPU struggling to fit a 13B model becomes usable the moment you move to 16GB.

The tiers are real boundaries. Honor them, and you buy once. Cross them, and you buy again in six months. Use the table, plug in your target model, and pick your tier. The cards follow from there.