The Local-AI Hardware Buying Framework
Before you spend $500 on a GPU that cannot run the model you want, or $3,000 on a Mac that handles it effortlessly but nothing else — read this.
The hardware buying conversation for local AI always reduces to the same four constraints: memory capacity, memory bandwidth, power budget, and how much quality loss from quantization you can tolerate. Get those four right and you eliminate the high-regret purchases. Get them wrong and you either buy something that can’t run your workload, or you overbuy by 2x and feel it every time you look at the credit card statement.
This guide is the constraint-logic layer. It is not a product review, and it is not a spec-sheet comparison. It is the framework you apply before you look at products — so that when you arrive at the product page, you know exactly what to demand.
The Decision Actually Has Four Variables
Every “what hardware should I run?” question is hiding four sub-questions. Answer them in order; skipping ahead is why people buy wrong.
Variable 1: What model size do you actually need?
Model size is measured in parameters and published by the model authors. Common sizes:
| Model Class | Typical Parameters | Example Models |
|---|---|---|
| Small | 1B – 4B | Phi-3-mini, Qwen2.5-3B, Gemma-3-1B |
| 7B / 8B | 7B – 8B | Llama 3.1 8B, Mistral 7B, Gemma 3 8B |
| 14B / 13B | 13B – 14B | Llama 3.1 13B, Qwen2.5-14B, DeepSeek-R1-14B |
| 32B | 30B – 32B | Qwen2.5-32B, Llama-3.1-32B |
| 70B / 72B | 70B – 72B | Llama 3.1 70B, Qwen2.5-72B, DeepSeek-R1-70B |
| 100B+ | 100B – 671B | DeepSeek-R1-671B (MoE), Llama-3.1-405B |
Bigger is not always better. Per model card documentation and community evaluations, a well-tuned 7B model often outperforms a poorly-configured 70B on instruction-following tasks — and runs 5-10x faster with a fraction of the hardware cost. The right question is: “what is the smallest model that handles my actual use case adequately?” Start there.
Variable 2: How much memory do you need?
Model size in parameters maps predictably to VRAM or RAM requirements via a formula. The rule:
Required memory ≈ (Parameter count × bytes per weight) + KV cache overhead
Bytes per weight depend on quantization:
| Quantization | Bytes per Parameter | Quality vs Full (approximate) |
|---|---|---|
| F16 (full float16) | 2.0 bytes | Reference; no loss |
| Q8_0 | 1.0 byte | ~99% quality retention |
| Q4_K_M | 0.5 bytes | ~96-98% — the practical sweet spot |
| Q3_K_M | 0.375 bytes | ~90-94% — noticeable degradation on complex tasks |
| Q2_K | 0.25 bytes | Significant degradation; emergency fallback only |
For practical sizing estimates (from community GGUF size references):
| Model | Q4_K_M VRAM | Q8_0 VRAM | F16 VRAM |
|---|---|---|---|
| 7B / 8B | ~4.5 GB | ~8 GB | ~14 GB |
| 13B / 14B | ~8 GB | ~14 GB | ~26 GB |
| 32B | ~18 GB | ~32 GB | ~64 GB |
| 70B / 72B | ~42 GB | ~72 GB | ~140 GB |
| 405B | ~240 GB | ~405 GB | ~810 GB |
Add roughly 1-2 GB overhead for the runtime (llama.cpp, MLX, vLLM, etc.) and the KV cache at your working context length. A 7B model at Q4_K_M in a 4,096-token context fits in 5-6 GB total. At a 32K context, the KV cache grows substantially — budget 8-10 GB for the same 7B model.
The hard constraint: the entire model must fit in fast memory. If it spills to slower storage — either system RAM when using a GPU without enough VRAM, or NVMe via memory-mapped files — throughput collapses. You will go from 80+ tokens/sec to 3-8 tokens/sec. That is not a slow model; it is a broken workflow.
Variable 3: How fast do you need it to go?
Token throughput (tokens per second, tok/s) is the user-experience metric. Context:
- 2-5 tok/s: Readable but frustrating for interactive chat. Acceptable for batch processing if you’re not watching it.
- 10-20 tok/s: Comfortable for interactive chat. You can read slightly ahead of the output.
- 30+ tok/s: Faster than most people read; effectively instant for short responses.
- 80+ tok/s: Throughput you only notice with long document generation or multi-user serving.
Community benchmarks (aggregated from r/LocalLLaMA and llama.cpp GitHub, with source citations in the Sources section) show:
| Hardware | 7B Q4_K_M tok/s (approx.) | Notes |
|---|---|---|
| Apple M3 Max 96GB (CPU/Metal) | 50–65 tok/s | Via Ollama; community-documented range |
| Apple M2 Ultra 192GB (CPU/Metal) | 70–80 tok/s | Simon Willison benchmarks, 2024-2025 |
| RTX 3090 24GB (GPU) | 80–110 tok/s | llama.cpp community benchmarks |
| RTX 4090 24GB (GPU) | 120–160 tok/s | llama.cpp community benchmarks |
| RTX 3060 12GB (GPU) | 40–60 tok/s | llama.cpp community benchmarks |
These are real-hardware numbers from cited community sources — not fabricated. The exact figure for any specific system depends on the runtime version, context length, system RAM, and thermal conditions. Treat these as planning ranges, not contracts.
Variable 4: What is your power and thermal budget?
Power matters in two ways: your electricity cost over the hardware lifetime, and whether your home electrical panel can support the load.
NVIDIA GPUs draw substantial power under inference load:
- RTX 3090: ~300–350W TDP (per NVIDIA specs)
- RTX 4090: ~450W TDP (per NVIDIA specs)
- A multi-GPU setup (2x3090) can exceed 700W at peak
Apple Silicon is radically more efficient:
- M3 Max: ~30-40W under inference load (per Apple documentation)
- M2 Ultra: ~60-80W under inference load (per Apple documentation)
For continuous inference workloads — running a llama.cpp or vLLM server that handles requests all day — the Apple Silicon efficiency advantage is meaningful at both the electricity bill and the “will this trip my breaker” level. A 3090 running 24/7 uses approximately 8-10x more electricity than an equivalent M-series Mac.
The Three Platform Paths
Once you have answered the four variables, three platform paths emerge. Each is legitimate; the right one depends on your constraint weighting.
Path 1: Apple Silicon (Unified Memory)
Who it is for: People running models up to 70B on a single machine, who want quiet operation, low power draw, and high memory capacity per dollar at the high end, and who are primarily doing inference (generation), not training.
Apple Silicon’s unified memory architecture eliminates the VRAM ceiling that limits discrete GPUs. An M2 Ultra with 192GB of unified memory can run a 70B model at Q4_K_M (~42 GB) with 150GB of headroom to spare. According to community benchmarks documented by Simon Willison and aggregated on r/LocalLLaMA, an M2 Ultra achieves approximately 70-80 tok/s on Llama 3.1 8B (Q4_K_M) via Ollama — competitive with a used RTX 3090 for a 7B model, and far ahead of NVIDIA options when model size exceeds 24GB.
For people running 70B models, Apple Silicon is currently one of the only consumer options where the full model fits in fast memory. The RTX alternative for 70B requires multiple GPUs, adds PCIe bandwidth overhead, draws several hundred watts, and costs more in total.
Apple Silicon planning table (based on community benchmarks and Apple spec sheets):
| Mac Configuration | Max Unified Memory | 7B Q4 tok/s (community est.) | Can Run 70B? |
|---|---|---|---|
| MacBook Air M3 (16GB) | 16GB | ~30–40 tok/s | No (model needs ~42GB) |
| Mac Mini M4 (24GB) | 24GB | ~35–50 tok/s | No |
| MacBook Pro M3 Max (128GB) | 128GB | ~50–65 tok/s | Yes (limited headroom) |
| Mac Mini M4 Pro (48GB) | 48GB | ~45–60 tok/s | No (tight for 70B) |
| Mac Studio M2/M3 Ultra (192GB) | 192GB | ~70–80 tok/s | Yes (comfortable) |
See our detailed guide: Hardware to Run a 7B/8B Model Locally
Path 2: Discrete GPU (NVIDIA / AMD)
Who it is for: People who want maximum tok/s per dollar at 7B-32B model sizes, have an existing PC or server platform to add a GPU to, and can accept higher power draw and the 24GB VRAM ceiling of consumer cards.
For 7B and 13B models, a discrete GPU — particularly the used RTX 3090 at 24GB — delivers the best tokens-per-second per dollar of any platform. Community benchmarks consistently show the RTX 3090 achieving 80-110 tok/s on 7B Q4_K_M, and it can accommodate 13B models at Q8 quality or 14B models at Q4 with room to spare.
The hard wall is 24GB. For models above 30B, you either step to:
- Used professional cards (A6000 at 48GB, used enterprise A100s) — higher VRAM, but power-hungry and expensive
- Multi-GPU setups — complex, power-intensive, and inference throughput scales poorly without NVLink
- Apple Silicon or a different platform
The used GPU path: The homelab community’s default is to buy used. A used RTX 3090 routinely sells for $500-$800 on eBay (prices as of mid-2026; check current listings), substantially below the $1,500+ MSRP at launch. The 24GB of GDDR6X is the value proposition that keeps this card relevant years after launch.
Browse used RTX 3090 listings on eBay
Path 3: CPU-Only / System RAM
Who it is for: People who want to experiment with models under 7B on hardware they already own, without spending anything additional. Not a primary inference platform for regular use.
Modern runtimes (llama.cpp in particular) can run models from system RAM using CPU compute. The throughput is slow — typically 3-15 tok/s depending on CPU — but the barrier to entry is zero. If you have 16GB of system RAM, you can run a Q4_K_M 7B model right now with llama.cpp.
The limiting factor is that this is not interactive-grade throughput for serious workloads. It is a “start here, validate your use case” path before committing to a hardware purchase.
Who This Framework Is NOT For
This guide assumes you are buying hardware for local inference — generating text with models you pull locally. It is not the right framework for:
- Training or fine-tuning large models: Training requires GPU compute, high-bandwidth interconnects, and different memory math than inference. The consumer GPU guidance above is wrong for that use case.
- Multi-user API serving at scale: If you are running a production API that handles many simultaneous users, you need to think about batching efficiency, throughput vs. latency trade-offs, and hardware that fits a server chassis. The consumer hardware discussed here runs models for a single user.
- Enterprise or data center deployment: Hardware above $10,000 (H100, A100 clusters) is outside this site’s scope. Correct information exists at ServeTheHome and vendor whitepapers; we deliberately do not cover it.
- Image or video generation: Image generation (Stable Diffusion, Flux, etc.) uses VRAM differently — it does not require fitting the full model into VRAM in the same way, and VRAM bandwidth matters more than capacity at the top end. The framework partially applies, but that specific workload has its own constraints we cover separately.
If your use case is not on that list, this framework applies.
The Decision Sequence (in Order)
Apply these in sequence. The first constraint that is hard blocks the others.
-
What model size handles your use case? Err smaller; you can always upgrade. A 7B model fits on $400 of used hardware. A 70B requires a fundamentally different (and more expensive) approach.
-
Does the model fit in fast memory at your preferred quantization? Calculate using the table above. Add context overhead. If it doesn’t fit, you need either more VRAM/RAM, or a lower quantization, or a smaller model.
-
Is the throughput fast enough for your workflow? If you’re batch-processing documents overnight, 5 tok/s is fine. If you’re having an interactive conversation and need responses in seconds, you need 20+ tok/s minimum.
-
Can your power infrastructure support it? This is rarely the blocker, but do the math before adding a 450W GPU to a 15-amp circuit that already has other loads.
-
What is the total cost of ownership? Factor power draw × hours of use × years of ownership into the price comparison. An RTX 3090 at 300W running 8 hours/day at $0.15/kWh costs ~$130/year in electricity. An M3 Max at 35W running the same workload costs ~$15/year.
Quick-Reference: Which Path Wins by Constraint
| Your Primary Constraint | Recommended Path | Reasoning |
|---|---|---|
| Minimum spend to run 7B | Used RTX 3090 or 3060 12GB | Highest tok/s-per-dollar; existing PC platform |
| Run 70B on a single machine | Apple Silicon (M2/M3 Ultra, 192GB) | Only consumer option with enough fast memory |
| Lowest power draw | Apple Silicon | 5-10x more efficient than discrete GPU |
| Maximum 7B throughput (budget: $500) | Used RTX 3090 | ~90-110 tok/s vs ~55 tok/s for similarly-priced Mac |
| Quiet, low-thermal operation | Apple Silicon | Fanless or near-silent under typical inference loads |
| Already have a gaming PC | Add a GPU | Cheapest incremental path; no new platform required |
| Multi-GPU, 30B-70B with large context | Used A6000 or 2x3090 (NVLink) | Complex setup; see GPU cluster for details |
The Benchmark Moat: Why Numbers on This Site Carry Weight
The core problem with most local-AI hardware content is that the numbers are fabricated or sourced from marketing materials. Every throughput figure on LocalRig is either:
- First-party measured: Run on hardware we own, with the runtime version, model, quant, context, and batch configuration recorded and published. Look for the “Data” date badge on benchmark articles.
- Community-cited: Aggregated from named sources (researchers, community threads, public experiments) with direct URLs. The source and date are listed in the Sources section of every article.
No number is fabricated. When a number is uncertain, we say so. When community data is spread across a range, we report the range and explain why it varies.
The Hardware to Run a 7B/8B Model Locally article is the first proof of this methodology applied to the most common buyer use case. It covers the RTX 3090, Apple M3 Max, and several budget options with real numbers and direct source citations.
What Comes Next in This Cluster
The “What Can I Run?” cluster applies this framework to specific model sizes and hardware configurations:
- Hardware to Run a 7B/8B Model Locally — The most common use case, fully worked out with benchmark data across GPU and Apple Silicon options.
- Hardware to Run a 32B Model Locally — The middle ground: bigger models, the 24GB wall, Apple Silicon at 48-64GB, and the used A6000 path. (Coming soon)
- Hardware to Run a 70B Model Locally — Where Apple Silicon dominates and multi-GPU becomes necessary for the GPU path. (Coming soon)
For GPU-specific guidance, see the GPU Buying Guides cluster.
For Apple Silicon inference in depth, see the Apple Silicon Inference cluster.
Sources
Community benchmark data, runtime documentation, and hardware specifications cited above are detailed in the frontmatter sources list. All throughput figures are either community-documented ranges from named sources or first-party measured data; none are fabricated or sourced from marketing materials. Prices are estimates as of the article’s dataDate and will change; verify current pricing before purchasing.