What Can I Run?

Hardware to Run a 70B Model Locally: VRAM, the 48GB Wall, and Your Real Options

Top Pick 2× used RTX 3090 (48GB) or a 128GB+ Apple Silicon Mac $1,000–$1,600 (two used 3090s) · $3,999+ (128GB Mac)
Check Price

A 70B model is the first size class where the hardware decision stops being “which GPU” and becomes “which architecture.” A 7B model fits almost anywhere; a 70B model does not fit in a single consumer GPU at usable quality, so you are choosing between multiple GPUs, a large unified-memory Mac, or renting. This guide gives you the VRAM math, the practical floor, and the four real paths — with honest notes on where the data is solid and where it is not.

Parent guide: The Local-AI Hardware Buying Framework — read the constraint logic first. For the smaller class, see Hardware to Run a 7B/8B Model Locally.

The VRAM math for 70B

The sizing rule is simple and it is the whole game at this size: VRAM (GB) ≈ parameters (billions) × (bits per weight ÷ 8), then add overhead. For a 70B model:

PrecisionWeights onlyPractical total (weights + KV cache + overhead)
FP16 (full)~140 GB~150–170 GB
FP8 / INT8~70 GB~80–90 GB
Q4_K_M (4-bit)~35–40 GB~44–52 GB
Q3 / Q2 (aggressive)~26–33 GB~34–44 GB (quality cost rises)

Two things follow immediately. First, 24GB is not enough for a 70B model at usable quality — even Q4_K_M weights alone (~35–40 GB) exceed a single 3090 or 4090. Second, 48GB is the practical floor at Q4: it holds the weights plus a working KV cache and runtime overhead with a little room. Weights are only part of the bill — the KV cache grows with context length and will eat the headroom at long contexts, so budget the 10–30% overhead band, more if you run 32K+ context. The full reasoning is in What Is Quantization and the GPU memory math section of the framework.

Quantization at 70B: Q4_K_M is the sweet spot. Larger models degrade more gracefully under quantization than small ones, so a 70B at Q4 generally beats a 34B at Q8 for the same memory budget. Go below Q4 (Q3/Q2) only when you must squeeze into a tighter pool and you have tested that the quality still serves your use case.

A note on throughput data

A 70B model decodes far more slowly than a 7B one because decode is memory-bandwidth-bound and you are now moving ~40 GB of weights per token instead of ~4 GB. Community reports put 70B Q4 generation in the low-double-digit tokens-per-second range on both dual-3090 rigs and large Apple Silicon — but those numbers vary widely with quant, context, runtime, and interconnect, and LocalRig has not yet measured 70B first-party. Treat any specific figure below as a community range, not a LocalRig benchmark; a first-party 70B run is queued, and verified numbers will land on the benchmarks page. The fit math above is the part you can plan around with confidence; throughput is the part to verify for your exact setup.

The four paths

Path A — 2× used RTX 3090 (48GB total)

The community’s standard 70B rig. Two used 3090s give you 48 GB of fast GDDR6X for roughly $1,000–$1,600 on the used market, and llama.cpp, ExLlamaV3, and vLLM all split a 70B across both cards (tensor or pipeline parallel). The catch is interconnect: without NVLink, the cards talk over PCIe, so you do not get 2× the speed of one card — multi-GPU here buys capacity (the model fits), not linear throughput. Plan PSU and cooling for two ~300–350W cards.

Browse used RTX 3090 on eBay → · for the single-card reasoning behind this pick, see the best GPU for local LLM guide.

Path B — single used RTX A6000 (48GB)

One card, 48 GB, no multi-GPU split to manage, and far lower power than two 3090s. The A6000 (Ampere, 48GB) is the simplest way to hold a 70B Q4 in a single device, and the used/refurbished enterprise market is where it is affordable. You give up some raw bandwidth versus a 4090-class card, but you gain capacity and simplicity. Browse used A6000 on eBay →

Path C — Apple Silicon with 128GB+ unified memory

A Mac Studio or MacBook Pro with 128 GB (M3/M4 Max) or 192 GB (M2 Ultra) holds a 70B Q4 comfortably — 40 GB out of a much larger shared pool — while drawing a fraction of the power of a dual-GPU rig and staying quiet. The tradeoff is bandwidth: unified memory feeds the GPU more slowly than dedicated GDDR6X or HBM, so decode is capacity-rich but not the fastest. This is the path if you also want a low noise floor, low power, or a laptop form factor, or if you intend to step up to even larger models later. Check Mac Studio configurations on Amazon →

Path D — cloud, when 70B is occasional

If you need a 70B only now and then, renting an 80GB A100/H100 by the hour can beat buying a 48GB rig outright — especially before you have validated that local is worth it. Run the honest break-even before committing hardware: see the Local vs Cloud cluster. (Cloud-GPU referral programs are not yet wired into LocalRig, so this is guidance, not a monetized link.)

Decision matrix

Your situationPathWhy
Best throughput per dollar, can manage a build2× used RTX 3090 (eBay)48 GB for ~$1,000–$1,600; the community standard
Want one card, lower power, less fussUsed A6000 48GB (eBay)Single-device 48 GB; simpler than multi-GPU
Quiet, efficient, or laptop; may go bigger later128GB+ Apple SiliconFits 70B with room; lowest power and noise
70B only occasionallyCloud A100/H100 (hourly)Skip the capital outlay until local pays off

Who This Is NOT For

  • Anyone who has not done the fit math first. If you are unsure whether you even need 70B, start at the 7B/8B guide — most personal workloads are well served by an 8B or a 32B at a fraction of the cost and complexity.
  • Single-GPU buyers hoping a 4090 will do it. A 70B does not fit in 24 GB at usable quality. If a single card is your hard limit, run a smaller model well rather than a 70B badly.
  • People who need maximum tokens-per-second. 70B decode is bandwidth-bound and slower than smaller models on every consumer path here. If latency is the priority, a 32B-class model on a 24GB card will feel far snappier.
  • Trainers and fine-tuners. This is an inference guide. Fine-tuning a 70B has very different (and much larger) memory requirements.
  • Buyers who want exact throughput guarantees today. Until LocalRig publishes a first-party 70B benchmark, treat throughput as a range to verify on your own hardware, not a promise.

Sources

  • LocalRig knowledge note, “GPU Memory Math for LLMs (2026)” — the VRAM ≈ params × bits/8 sizing model and the 70B size table.
  • r/LocalLLaMA 70B benchmark and build threads (2024–2025) — community throughput ranges, explicitly not independently verified by LocalRig; a first-party 70B run is queued.
  • NVIDIA RTX 3090 (24GB) and A6000 (48GB) product specifications, nvidia.com.
  • Apple M2 Ultra (192GB) and M3/M4 Max (128GB) unified memory specifications, apple.com.

Sources

  • LocalRig knowledge note: GPU Memory Math for LLMs (2026) — VRAM ≈ params × bits/8 sizing model (knowledge/sources/gpu-memory-math-for-llms-2026.md)
  • r/LocalLLaMA 70B benchmark and build threads (2024–2025) — community throughput ranges, not independently verified by LocalRig
  • NVIDIA RTX 3090 / A6000 product specifications: nvidia.com (24GB GDDR6X / 48GB GDDR6)
  • Apple M2 Ultra / M3 Max unified memory specifications: apple.com (192GB / 128GB)