Hardware to Run a 70B Model Locally: VRAM, the 48GB Wall, and Your Real Options

A 70B model is the first size class where the hardware decision stops being “which GPU” and becomes “which architecture.” A 7B model fits almost anywhere; a 70B model does not fit in a single consumer GPU at usable quality, so you are choosing between multiple GPUs, a large unified-memory Mac, or renting. This guide gives you the VRAM math, the practical floor, and the four real paths — with honest notes on where the data is solid and where it is not.

Parent guide: The Local-AI Hardware Buying Framework — read the constraint logic first. For the smaller class, see Hardware to Run a 7B/8B Model Locally.

The VRAM math for 70B

The sizing rule is simple and it is the whole game at this size: VRAM (GB) ≈ parameters (billions) × (bits per weight ÷ 8), then add overhead. For a 70B model:

Precision	Weights only	Practical total (weights + KV cache + overhead)
FP16 (full)	~140 GB	~150–170 GB
FP8 / INT8	~70 GB	~80–90 GB
Q4_K_M (4-bit)	~35–40 GB	~44–52 GB
Q3 / Q2 (aggressive)	~26–33 GB	~34–44 GB (quality cost rises)

Two things follow immediately. First, 24GB is not enough for a 70B model at usable quality — even Q4_K_M weights alone (~35–40 GB) exceed a single 3090 or 4090. Second, 48GB is the practical floor at Q4: it holds the weights plus a working KV cache and runtime overhead with a little room. Weights are only part of the bill — the KV cache grows with context length and will eat the headroom at long contexts, so budget the 10–30% overhead band, more if you run 32K+ context. The full reasoning is in What Is Quantization and the GPU memory math section of the framework.

Quantization at 70B: Q4_K_M is the sweet spot. Larger models degrade more gracefully under quantization than small ones, so a 70B at Q4 generally beats a 34B at Q8 for the same memory budget. Go below Q4 (Q3/Q2) only when you must squeeze into a tighter pool and you have tested that the quality still serves your use case.

A note on throughput data

A 70B model decodes far more slowly than a 7B one because decode is memory-bandwidth-bound and you are now moving ~40 GB of weights per token instead of ~4 GB. Community reports put 70B Q4 generation in the low-double-digit tokens-per-second range on both dual-3090 rigs and large Apple Silicon — but those numbers vary widely with quant, context, runtime, and interconnect, and LocalRig has not yet measured 70B first-party. Treat any specific figure below as a community range, not a LocalRig benchmark; a first-party 70B run is queued, and verified numbers will land on the benchmarks page. The fit math above is the part you can plan around with confidence; throughput is the part to verify for your exact setup.

The four paths

Path A — 2× used RTX 3090 (48GB total)

The community’s standard 70B rig. Two used 3090s give you 48 GB of fast GDDR6X for roughly $1,000–$1,600 on the used market, and llama.cpp, ExLlamaV3, and vLLM all split a 70B across both cards (tensor or pipeline parallel). The catch is interconnect: without NVLink, the cards talk over PCIe, so you do not get 2× the speed of one card — multi-GPU here buys capacity (the model fits), not linear throughput. Plan PSU and cooling for two ~300–350W cards.

Browse used RTX 3090 on eBay → · for the single-card reasoning behind this pick, see the best GPU for local LLM guide.

Path B — single used RTX A6000 (48GB)

One card, 48 GB, no multi-GPU split to manage, and far lower power than two 3090s. The A6000 (Ampere, 48GB) is the simplest way to hold a 70B Q4 in a single device, and the used/refurbished enterprise market is where it is affordable. You give up some raw bandwidth versus a 4090-class card, but you gain capacity and simplicity. Browse used A6000 on eBay →

Path C — Apple Silicon with 128GB+ unified memory

A Mac Studio or MacBook Pro with 128 GB (M3/M4 Max) or 192 GB (M2 Ultra) holds a 70B Q4 comfortably — 40 GB out of a much larger shared pool — while drawing a fraction of the power of a dual-GPU rig and staying quiet. The tradeoff is bandwidth: unified memory feeds the GPU more slowly than dedicated GDDR6X or HBM, so decode is capacity-rich but not the fastest. This is the path if you also want a low noise floor, low power, or a laptop form factor, or if you intend to step up to even larger models later. Check Mac Studio configurations on Amazon →

Path D — cloud, when 70B is occasional

If you need a 70B only now and then, renting an 80GB A100/H100 by the hour can beat buying a 48GB rig outright — especially before you have validated that local is worth it. Run the honest break-even before committing hardware: see the Local vs Cloud cluster. (Cloud-GPU referral programs are not yet wired into LocalRig, so this is guidance, not a monetized link.)

Decision matrix

Your situation	Path	Why
Best throughput per dollar, can manage a build	2× used RTX 3090 (eBay)	48 GB for ~$1,000–$1,600; the community standard
Want one card, lower power, less fuss	Used A6000 48GB (eBay)	Single-device 48 GB; simpler than multi-GPU
Quiet, efficient, or laptop; may go bigger later	128GB+ Apple Silicon	Fits 70B with room; lowest power and noise
70B only occasionally	Cloud A100/H100 (hourly)	Skip the capital outlay until local pays off

Who This Is NOT For

Anyone who has not done the fit math first. If you are unsure whether you even need 70B, start at the 7B/8B guide — most personal workloads are well served by an 8B or a 32B at a fraction of the cost and complexity.
Single-GPU buyers hoping a 4090 will do it. A 70B does not fit in 24 GB at usable quality. If a single card is your hard limit, run a smaller model well rather than a 70B badly.
People who need maximum tokens-per-second. 70B decode is bandwidth-bound and slower than smaller models on every consumer path here. If latency is the priority, a 32B-class model on a 24GB card will feel far snappier.
Trainers and fine-tuners. This is an inference guide. Fine-tuning a 70B has very different (and much larger) memory requirements.
Buyers who want exact throughput guarantees today. Until LocalRig publishes a first-party 70B benchmark, treat throughput as a range to verify on your own hardware, not a promise.

Sources

LocalRig knowledge note, “GPU Memory Math for LLMs (2026)” — the VRAM ≈ params × bits/8 sizing model and the 70B size table.
r/LocalLLaMA 70B benchmark and build threads (2024–2025) — community throughput ranges, explicitly not independently verified by LocalRig; a first-party 70B run is queued.
NVIDIA RTX 3090 (24GB) and A6000 (48GB) product specifications, nvidia.com.
Apple M2 Ultra (192GB) and M3/M4 Max (128GB) unified memory specifications, apple.com.