Hardware to Run a 70B Model Locally: VRAM, the 48GB Wall, and Your Real Options
A 70B model is the first size class where the hardware decision stops being “which GPU” and becomes “which architecture.” A 7B model fits almost anywhere; a 70B model does not fit in a single consumer GPU at usable quality, so you are choosing between multiple GPUs, a large unified-memory Mac, or renting. This guide gives you the VRAM math, the practical floor, and the four real paths — with honest notes on where the data is solid and where it is not.
Parent guide: The Local-AI Hardware Buying Framework — read the constraint logic first. For the smaller class, see Hardware to Run a 7B/8B Model Locally.
The VRAM math for 70B
The sizing rule is simple and it is the whole game at this size: VRAM (GB) ≈ parameters (billions) × (bits per weight ÷ 8), then add overhead. For a 70B model:
| Precision | Weights only | Practical total (weights + KV cache + overhead) |
|---|---|---|
| FP16 (full) | ~140 GB | ~150–170 GB |
| FP8 / INT8 | ~70 GB | ~80–90 GB |
| Q4_K_M (4-bit) | ~35–40 GB | ~44–52 GB |
| Q3 / Q2 (aggressive) | ~26–33 GB | ~34–44 GB (quality cost rises) |
Two things follow immediately. First, 24GB is not enough for a 70B model at usable quality — even Q4_K_M weights alone (~35–40 GB) exceed a single 3090 or 4090. Second, 48GB is the practical floor at Q4: it holds the weights plus a working KV cache and runtime overhead with a little room. Weights are only part of the bill — the KV cache grows with context length and will eat the headroom at long contexts, so budget the 10–30% overhead band, more if you run 32K+ context. The full reasoning is in What Is Quantization and the GPU memory math section of the framework.
Quantization at 70B: Q4_K_M is the sweet spot. Larger models degrade more gracefully under quantization than small ones, so a 70B at Q4 generally beats a 34B at Q8 for the same memory budget. Go below Q4 (Q3/Q2) only when you must squeeze into a tighter pool and you have tested that the quality still serves your use case.
A note on throughput data
A 70B model decodes far more slowly than a 7B one because decode is memory-bandwidth-bound and you are now moving ~40 GB of weights per token instead of ~4 GB. Community reports put 70B Q4 generation in the low-double-digit tokens-per-second range on both dual-3090 rigs and large Apple Silicon — but those numbers vary widely with quant, context, runtime, and interconnect, and LocalRig has not yet measured 70B first-party. Treat any specific figure below as a community range, not a LocalRig benchmark; a first-party 70B run is queued, and verified numbers will land on the benchmarks page. The fit math above is the part you can plan around with confidence; throughput is the part to verify for your exact setup.
The four paths
Path A — 2× used RTX 3090 (48GB total)
The community’s standard 70B rig. Two used 3090s give you 48 GB of fast GDDR6X for roughly $1,000–$1,600 on the used market, and llama.cpp, ExLlamaV3, and vLLM all split a 70B across both cards (tensor or pipeline parallel). The catch is interconnect: without NVLink, the cards talk over PCIe, so you do not get 2× the speed of one card — multi-GPU here buys capacity (the model fits), not linear throughput. Plan PSU and cooling for two ~300–350W cards.
Browse used RTX 3090 on eBay → · for the single-card reasoning behind this pick, see the best GPU for local LLM guide.
Path B — single used RTX A6000 (48GB)
One card, 48 GB, no multi-GPU split to manage, and far lower power than two 3090s. The A6000 (Ampere, 48GB) is the simplest way to hold a 70B Q4 in a single device, and the used/refurbished enterprise market is where it is affordable. You give up some raw bandwidth versus a 4090-class card, but you gain capacity and simplicity. Browse used A6000 on eBay →
Path C — Apple Silicon with 128GB+ unified memory
A Mac Studio or MacBook Pro with 128 GB (M3/M4 Max) or 192 GB (M2 Ultra) holds a 70B Q4 comfortably — 40 GB out of a much larger shared pool — while drawing a fraction of the power of a dual-GPU rig and staying quiet. The tradeoff is bandwidth: unified memory feeds the GPU more slowly than dedicated GDDR6X or HBM, so decode is capacity-rich but not the fastest. This is the path if you also want a low noise floor, low power, or a laptop form factor, or if you intend to step up to even larger models later. Check Mac Studio configurations on Amazon →
Path D — cloud, when 70B is occasional
If you need a 70B only now and then, renting an 80GB A100/H100 by the hour can beat buying a 48GB rig outright — especially before you have validated that local is worth it. Run the honest break-even before committing hardware: see the Local vs Cloud cluster. (Cloud-GPU referral programs are not yet wired into LocalRig, so this is guidance, not a monetized link.)
Decision matrix
| Your situation | Path | Why |
|---|---|---|
| Best throughput per dollar, can manage a build | 2× used RTX 3090 (eBay) | 48 GB for ~$1,000–$1,600; the community standard |
| Want one card, lower power, less fuss | Used A6000 48GB (eBay) | Single-device 48 GB; simpler than multi-GPU |
| Quiet, efficient, or laptop; may go bigger later | 128GB+ Apple Silicon | Fits 70B with room; lowest power and noise |
| 70B only occasionally | Cloud A100/H100 (hourly) | Skip the capital outlay until local pays off |
Who This Is NOT For
- Anyone who has not done the fit math first. If you are unsure whether you even need 70B, start at the 7B/8B guide — most personal workloads are well served by an 8B or a 32B at a fraction of the cost and complexity.
- Single-GPU buyers hoping a 4090 will do it. A 70B does not fit in 24 GB at usable quality. If a single card is your hard limit, run a smaller model well rather than a 70B badly.
- People who need maximum tokens-per-second. 70B decode is bandwidth-bound and slower than smaller models on every consumer path here. If latency is the priority, a 32B-class model on a 24GB card will feel far snappier.
- Trainers and fine-tuners. This is an inference guide. Fine-tuning a 70B has very different (and much larger) memory requirements.
- Buyers who want exact throughput guarantees today. Until LocalRig publishes a first-party 70B benchmark, treat throughput as a range to verify on your own hardware, not a promise.
Sources
- LocalRig knowledge note, “GPU Memory Math for LLMs (2026)” — the VRAM ≈ params × bits/8 sizing model and the 70B size table.
- r/LocalLLaMA 70B benchmark and build threads (2024–2025) — community throughput ranges, explicitly not independently verified by LocalRig; a first-party 70B run is queued.
- NVIDIA RTX 3090 (24GB) and A6000 (48GB) product specifications, nvidia.com.
- Apple M2 Ultra (192GB) and M3/M4 Max (128GB) unified memory specifications, apple.com.