Name: Hardware to Run a 7B/8B Model Locally: RTX 3090, Apple M3 Max, and Budget Options
Creator: LocalRig
Published: 2026-06-27T00:00:00.000Z
License: https://creativecommons.org/licenses/by/4.0/

The 7B and 8B parameter model class is the workhouse of local AI. Llama 3.1 8B, Mistral 7B, Gemma 3 8B, Qwen2.5-7B, and their kin represent the sweet spot where quality is genuinely useful, VRAM requirements are affordable, and the hardware decision is tractable. This is the most common question the community asks — and the most common one that leads to expensive mistakes.

This guide gives you the benchmark numbers, the VRAM floor, and the honest trade-offs across four hardware categories. It is grounded in community-documented data, not fabricated first-party claims.

Parent guide: The Local-AI Hardware Buying Framework — the constraint-logic layer before you read any product-specific page.

What a 7B/8B Model Actually Needs

VRAM Floor

A 7B or 8B model in Q4_K_M quantization occupies approximately 4.3–4.8 GB of model weights. Add the runtime overhead (~0.5–1 GB) and a 4,096-token KV cache (~0.5 GB), and the practical minimum is 6 GB of fast memory.

At Q8_0 (higher quality, roughly 99% of float16 performance), the model weights expand to approximately 8 GB. Total runtime requirement: 9–10 GB.

At F16 (full quality, no quantization loss), the model needs approximately 14–16 GB. Runtime total: 15–17 GB.

The practical recommendation for most buyers: run Q4_K_M if your hardware has 8 GB or less of VRAM; run Q8_0 if you have 12–16 GB; run F16 only if you have 20+ GB and quality matters for your use case. The throughput cost of Q8 over Q4 is approximately 30–40%; the quality gain is measurable on factual recall and coding tasks, but invisible for casual chat.

What Throughput Feels Like

For interactive use — chat, code completion, document writing:

Less than 10 tok/s: The model is visibly lagging. You’ll finish thinking before it finishes the sentence. Acceptable for background batch jobs; frustrating for live interaction.
10–20 tok/s: Passable for chat. Feels slow compared to cloud APIs but usable.
20–50 tok/s: The comfortable range. Faster than most people read; responses feel responsive.
50+ tok/s: You stop noticing the generation delay. The bottleneck becomes your own thinking, not the hardware.

A 7B Q4_K_M model on the right hardware runs comfortably in the 30–120 tok/s range depending on the platform. The difference between the bottom and top of that range is real, but only you know how much it matters for your workflow.

First-Party Benchmark: Base M4 / 16 GB (LocalRig-Measured)

Unlike the community-cited figures elsewhere in this guide, the numbers in this section are first-party — measured by LocalRig on owned hardware, with a published, reproducible methodology and pinned runtime versions.

Hardware: Apple M4 (base, 10-core), 16 GB unified memory — Mac mini (Mac16,10)
Model: Llama 3.1 8B Instruct, Q4_K_M — the same GGUF weights run by both engines
Date: 2026-06-27

Runtime	tok/s (generation)	How it was measured
llama.cpp b9820 (`llama-bench`)	18.4	tg128 generation throughput, Metal backend
Ollama 0.30.11 (flash attention on)	19.5	eval throughput on a real prompt, 4,096-token context, 3 passes averaged

Two findings worth more than the raw number

1. There is no faster engine hiding behind Ollama — so run llama.cpp (or MLX) directly. Ollama is llama.cpp under the hood: it embeds the same Metal kernels, which is why the two land within ~1 tok/s of each other here (the small gap is measurement noise — Ollama’s eval throughput on a real prompt vs llama-bench’s tg128, plus Ollama’s flash-attention default — not a real engine advantage). Because the speed is identical, the choice comes down to control and transparency, and that’s where we land on llama.cpp for portable/CPU work and MLX for Apple-native workflows: pinned builds you can reproduce, explicit quantization and flags, and a clean path to a real serving stack later. Ollama’s convenience hides those defaults behind an opaque wrapper, which is why it’s not our recommended runtime — see how to run LLMs locally for the full engine breakdown.

2. The base M4 is slower than the community ranges suggest. The widely-cited “M4 Mac Mini: 30–50 tok/s” reflects the higher-bandwidth M4 Pro or optimistic configs. The base M4 — 16 GB, ~120 GB/s bandwidth — delivers about 19 tok/s on an 8B Q4 model. That’s still genuinely usable for interactive chat, but verify the exact chip (base M4 vs M4 Pro) before you buy: the Pro’s memory-bandwidth advantage is real and shows up directly in throughput.

Reproduce it: PYTHONPATH=src python3 scripts/run-bench.py --model llama3.1:8b-instruct-q4_K_M --quant Q4_K_M --runtime llama.cpp --runtime-version b9820 --context 4096 --batch 512 --hardware "Apple M4, 16 GB" (via llama-bench) — and the matching --runtime ollama run for the cross-check above. The full methodology and a machine-readable Dataset (Schema.org) are published alongside this article.

Benchmark Data: What These Options Actually Deliver

All figures below are community-cited data. Source URLs and collection methodology are in the Sources section and the methodology frontmatter field. None of these numbers are fabricated or first-party-claimed without a documented run.

Master Comparison Table

Hardware	VRAM / RAM	7B Q4_K_M tok/s	7B Q8_0 tok/s	Can Run 13B Q4?	Power (TDP)	Community Source
RTX 4090 24GB	24 GB GDDR6X	120–160	80–110	Yes (fits in 8 GB)	~450W	r/LocalLLaMA, llama.cpp benchmarks
RTX 3090 24GB	24 GB GDDR6X	80–110	60–80	Yes	~300–350W	r/LocalLLaMA, llama.cpp benchmarks
Apple M3 Max 128GB	128 GB unified	50–65	38–52	Yes (comfortably)	~35–45W	Community Ollama benchmarks, 2025
Apple M2 Ultra 192GB	192 GB unified	70–80	50–65	Yes	~60–80W	Simon Willison benchmarks, 2024–2025
RTX 3060 12GB	12 GB GDDR6	40–60	30–45	Tight (13B Q4 = ~8 GB)	~170W	llama.cpp community benchmarks
Apple M4 Pro Mac Mini 24GB	24 GB unified	30–50	22–38	No (13B Q4 needs ~8 GB; fits)	~20–30W	r/LocalLLaMA, Nov–Dec 2025
Apple M4 16GB (base, Mac mini)	16 GB unified	18.4 (llama.cpp) / 19.5 (Ollama)	n/a	No (16 GB)	~20–30W	LocalRig first-party, 2026-06-27

Reading this table: tok/s values are community-documented ranges from cited sources. Your specific result will vary by runtime version, batch configuration, context length, and system thermal state. Treat these as planning estimates.

Option A — Apple M3 Max (96GB or 128GB): The Benchmark-Featured Choice

The Apple M3 Max with 128GB of unified memory is the top pick in this guide, and for reasons that go beyond marketing.

Why Apple Silicon Is Different for This Workload

Standard GPUs have a hard split: fast memory (GDDR6X VRAM, on the graphics card) and slow memory (system DDR5 RAM, connected via PCIe). If a model exceeds VRAM, it spills to system RAM and throughput collapses — from 90 tok/s to 4 tok/s. The model is still running; you just can’t use it.

Apple Silicon uses a unified memory architecture. There is no split. The CPU, GPU, and Neural Engine all draw from the same physical pool at the same bandwidth. An M3 Max with 128GB has 400 GB/s of memory bandwidth across that entire pool. An RTX 3090’s GDDR6X has 936 GB/s — faster for pure GPU compute — but on Apple Silicon, there is no VRAM ceiling to hit.

For 7B models at Q4_K_M, this doesn’t matter much. The model fits in 6 GB; any platform above that threshold handles it. Where Apple Silicon’s architecture becomes the decisive factor is at 32B and 70B model sizes — but the throughput and efficiency characteristics at 7B are already competitive and worth documenting.

M3 Max Benchmark Data (Community-Cited)

Based on community Ollama benchmarks aggregated on r/LocalLLaMA (2025):

Llama 3.1 8B Q4_K_M via Ollama on M3 Max (96–128GB): approximately 50–65 tok/s at a 4,096-token context
Llama 3.1 8B Q8_0 via Ollama on M3 Max: approximately 38–52 tok/s

For context: the M2 Ultra at 192GB achieves approximately 70–80 tok/s on the same model per Simon Willison’s documented benchmarks (simonwillison.net, 2024–2025). The M3 Max trades 10-20 tok/s vs the Ultra tier to save approximately $2,000-$3,000 in hardware cost and substantial power.

M3 Max Configuration Notes

The M3 Max ships in 96GB and 128GB unified memory configurations. For 7B models, both are equivalent — a 7B Q4 model needs 6 GB, leaving you 90+ GB free for other applications. The differentiation between 96GB and 128GB becomes meaningful at 32B models (18 GB needed) or if you run multiple models simultaneously.

M3 Max 128GB MacBook Pro pricing as of mid-2026: approximately $3,999–$4,499 for the 14-inch, $4,499–$5,199 for the 16-inch. These are high prices for a 7B workload — the RTX 3090 path at $500-$700 used delivers better throughput at a fraction of the cost. The M3 Max justifies itself when you also want to run 32B models, need the laptop form factor, or care deeply about the power and noise floor.

Check Apple M3 Max MacBook Pro pricing on Amazon →

Option B — RTX 3090 (24GB): Best Throughput per Dollar for 7B

The RTX 3090 is the community’s most recommended discrete GPU for local LLM inference in 2025-2026. The reasons are specific and documented.

Why the 3090 Still Makes Sense

24GB GDDR6X VRAM: The single most important spec. 24GB lets you run a 7B model at Q8_0 quality (8 GB) with 16 GB to spare — enough to keep large context windows open. It also fits a 13B model at Q4_K_M (approximately 8 GB) with meaningful headroom.
936 GB/s memory bandwidth (per NVIDIA specifications): High bandwidth is what delivers tok/s on LLM inference. The 3090’s bandwidth is the reason it outperforms the RTX 4080 (which has less VRAM and lower bandwidth) for this specific workload.
Used market pricing: A used RTX 3090 sells for approximately $500–$800 on eBay as of mid-2026 (check current listings; the market fluctuates with NVIDIA’s new releases). At $600, you are paying roughly $25 per GB of GDDR6X at 936 GB/s bandwidth. There is no better deal in the GPU market for inference.

RTX 3090 Benchmark Data (Community-Cited)

From r/LocalLLaMA benchmark threads and llama.cpp community benchmarks (2024-2025):

Llama 3.1 8B Q4_K_M via llama.cpp (CUDA) on RTX 3090: approximately 80–110 tok/s at 4,096-token context
Llama 3.1 8B Q8_0 via llama.cpp on RTX 3090: approximately 60–80 tok/s

These figures reflect community results. The range reflects differences in CUDA version, system RAM configuration, PCIe lane availability, and whether GPU-only or GPU+CPU compute was used. The high end of the range (110 tok/s) is achievable with current llama.cpp builds, recent CUDA versions, and a PCIe 4.0 x16 slot.

What you will not get: multi-GPU scaling. Two RTX 3090s connected via PCIe (without NVLink) do not deliver 2x the throughput. LLM inference is memory-bandwidth-bound, not compute-bound, and PCIe bandwidth between GPUs is a bottleneck. If you need more than 24GB VRAM for a single model, the used RTX 3090 path is the wrong architecture — see the 70B model guide for that discussion.

RTX 3090 Buying Notes

The RTX 3090 is available used only — it launched in 2020, and NVIDIA has discontinued it. This means:

No manufacturer warranty on used cards
Check seller feedback, require photos of the heatsink and HDMI/DP ports
Budget for replacement thermal paste application on any card over two years old
Mining-used cards are available at lower prices but carry higher failure risk; verify the use history if possible

Browse used RTX 3090 on eBay →

RTX 3090 on Amazon (new third-party and refurbished listings) →

Option C — RTX 3060 12GB: The True Budget Floor

If the budget is under $300 and the use case is primarily 7B models, the RTX 3060 12GB is the minimum discrete GPU worth recommending.

RTX 3060 12GB Benchmark Data (Community-Cited)

From llama.cpp community benchmarks (2024–2025):

Llama 3.1 8B Q4_K_M via llama.cpp (CUDA) on RTX 3060 12GB: approximately 40–60 tok/s at 4,096 tokens

This is meaningful throughput — 40 tok/s is faster than most people read. However, the 12GB VRAM creates a ceiling:

7B Q4_K_M: fits comfortably (~5 GB used, 7 GB free)
7B Q8_0: fits (8 GB used, 4 GB free — but tight with larger contexts)
7B F16: does not fit (14 GB needed)
13B Q4_K_M: marginal (8 GB needed; 4 GB headroom for context)
13B Q8_0: does not fit (14 GB needed)

If there is any chance you will want to run 13B or 14B models in the next two years, the RTX 3060 12GB is a short-term purchase. The RTX 3090 at double the price delivers double the VRAM, significantly higher bandwidth, and 60-80% more throughput. Buy the 3060 only if $300 is a hard ceiling.

RTX 3060 12GB on Amazon →

Browse used RTX 3060 12GB on eBay →

Option D — Apple Mac Mini M4 (24GB): The Quiet Compact Option

The M4 Mac Mini launched in late 2024 and delivers a genuinely impressive amount of AI inference capability in a compact, efficient, sub-$800 package when configured with 24GB of unified memory.

M4 Mac Mini Benchmark Data (Community-Cited)

From r/LocalLLaMA and Ollama GitHub discussions (November–December 2025):

Llama 3.1 8B Q4_K_M via Ollama on M4 Pro Mac Mini (24GB): approximately 30–50 tok/s

LocalRig first-party measurement (base M4, 16 GB): 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11, which runs the same llama.cpp kernels underneath), measured 2026-06-27 — see the First-Party Benchmark section above. Run llama.cpp (or MLX) directly; the throughput is the same. The base M4 lands well below the 30–50 community range, which reflects the higher-bandwidth M4 Pro. If you’re buying a Mac Mini specifically for 8B inference, the M4 Pro’s memory bandwidth is the upgrade that matters — not the core count.

This range reflects the M4’s memory bandwidth (approximately 120 GB/s for the base M4) versus the M3 Max’s 400 GB/s. The M4 Pro in the Mac Mini (available as a build-to-order option) provides higher bandwidth and lands closer to 50-60 tok/s on the same model.

M4 Mac Mini Trade-offs

The case for the M4 Mac Mini: it is the lowest-cost entry into Apple Silicon inference that still delivers interactive-grade throughput for 7B models. At $799 (24GB, M4) or approximately $1,299 (24GB, M4 Pro), it undercuts any RTX 3090 system on total build cost while consuming 15-25W under load. It also fits in a desk drawer.

The case against: the 24GB ceiling is real. A 13B Q4 model needs approximately 8 GB; it fits, but you’re at 33% of total memory. A 14B Q4 model needs approximately 8-9 GB; also fits. A 32B model at Q4 needs approximately 18 GB — this does not fit in 24GB with runtime overhead. If you anticipate running larger models, the M4 Mac Mini is an upgrade path, not a long-term platform.

Apple Mac Mini M4 on Amazon →

Decision Matrix: Which Option for Your Situation?

Situation	Recommended Option	Why
Maximum throughput on 7B, budget $500-$800	Used RTX 3090 (eBay)	80-110 tok/s; best throughput/dollar
Will also run 32B or 70B models	Apple M3 Max 128GB or M2 Ultra	Only options that keep large models in fast memory
Quiet operation, low power	Mac Mini M4 24GB or Apple M3 Max	5-10x more efficient than discrete GPU
Already have a gaming PC	Add RTX 3090 or 3060	Cheapest incremental upgrade; no new platform
Hard budget under $300	RTX 3060 12GB (used or new)	40-60 tok/s; 12GB ceiling acknowledged
Need a laptop that also does local AI	MacBook Pro M3 Max	Only laptop with sufficient memory bandwidth + capacity
Experimenting, no hardware spend	CPU-only (llama.cpp on system RAM)	3-10 tok/s; validates the use case before buying

Who This Hardware Is NOT For

The options in this guide are for interactive inference of 7B and 8B models for personal or small-team use. This is not the right guide if:

You need to fine-tune or train models: Training has fundamentally different requirements — you need CUDA for most training frameworks, gradient checkpointing, and significantly more VRAM than inference. The Apple Silicon path is not the training path.
You need to serve multiple simultaneous users: A single RTX 3090 or M3 Max handles one user at a time well; under heavy multi-user load, you need batching-aware hardware and a different throughput calculus.
You want image generation: Stable Diffusion and Flux use VRAM differently. The 24GB RTX 3090 is still excellent for image generation, but the sizing logic is not the same. We cover that separately.
You plan to run models above 13B: This guide ends at 13B. For 32B and 70B models, the constraint analysis and hardware recommendations change substantially. See the 70B model guide (32B guide coming soon) in this cluster.
Your use case is bulk document processing that can run overnight: If throughput per hour matters more than latency per token, you may want a different GPU (or a cloud-GPU burst approach). The Local vs Cloud cluster covers this honestly.

Methodology

All benchmark figures in this article are community-cited data aggregated from named public sources with dates. None of these numbers are first-party measurements taken by LocalRig. Specific methodology:

Apple Silicon throughput figures sourced from Simon Willison’s documented benchmark experiments (2024-2025) and r/LocalLLaMA community benchmark threads (2025).
NVIDIA GPU throughput figures sourced from llama.cpp GitHub benchmark issues and r/LocalLLaMA community benchmark threads (2024-2025).
Ranges are reported where multiple community sources report differing results; the spread reflects real variance in runtime version, system configuration, context length, and thermal state.
All throughput figures assume: Llama 3.1 8B Instruct GGUF Q4_K_M, 4,096-token context, single-user, no batching unless noted.
Figures are reported as of dataDate: 2026-06-27. Newer runtime versions (Ollama, llama.cpp) frequently improve throughput; the figures above may be conservative for users running up-to-date software.

This article now includes LocalRig’s first-party benchmark of the base Apple M4 (16 GB) — see the First-Party Benchmark section above — measured on owned hardware with pinned runtime versions (Ollama 0.30.11, llama.cpp b9820 via llama-bench), a published methodology, and a machine-readable Schema.org Dataset. The remaining figures are community-cited and labeled as such throughout. As LocalRig benchmarks more owned hardware, additional first-party rows will be added to the benchmark database under the same standard.

Sources

Full source citations are in the frontmatter sources: list. Key references:

Simon Willison’s Apple Silicon LLM benchmarks: simonwillison.net (2024–2025)
r/LocalLLaMA community benchmark threads: RTX 3090, RTX 4090, Mac Mini M4, M3 Max results (2024–2025)
llama.cpp GitHub benchmark issues and community PRs: github.com/ggerganov/llama.cpp
TheBloke GGUF model size reference for quantization memory estimates: huggingface.co/TheBloke
NVIDIA RTX 3090, 3060 product specifications: nvidia.com
Apple M3 Max, M4 specifications: apple.com