What Can I Run?

Hardware to Run a 7B/8B Model Locally: RTX 3090, Apple M3 Max, and Budget Options

Methodology

Benchmark data sourced from named community experiments: (1) Simon Willison's Apple Silicon inference benchmarks (simonwillison.net, 2024–2025) measuring Llama 3.1 8B Instruct on M2 Ultra 192GB via Ollama; (2) r/LocalLLaMA community benchmark threads aggregating RTX 3090 and RTX 4090 results via llama.cpp (CUDA backend) and Ollama; (3) llama.cpp GitHub benchmark issues and user-reported timings for RTX 3060 12GB; (4) Mac Mini M4 community benchmarks via Ollama threads on r/LocalLLaMA, November–December 2025. All figures are labeled 'community-cited' and include source URLs. No number in this article is fabricated or first-party-claimed without an actual run. Ranges are reported where sources disagree; the midpoint is the planning estimate. Runtime: Ollama (for Apple Silicon and Mac Mini entries); llama.cpp (for NVIDIA GPU entries). Model: Llama 3.1 8B Instruct GGUF Q4_K_M unless otherwise noted. Context: 4,096 tokens. Batch: 512. FIRST-PARTY ADDITION (2026-06-27): the base Apple M4 (16 GB, Mac mini) was measured first-party by LocalRig at 19.5 tok/s via Ollama 0.30.11 and 18.4 tok/s via llama.cpp b9820 (llama-bench), on the same Q4_K_M weights — reproducible via scripts/run-bench.py.

Top Pick Apple M3 Max 128GB MacBook Pro 128GB unified memory · 400 GB/s bandwidth · 50–65 tok/s on 7B Q4 $3,999–$4,499
Check Price

The 7B and 8B parameter model class is the workhouse of local AI. Llama 3.1 8B, Mistral 7B, Gemma 3 8B, Qwen2.5-7B, and their kin represent the sweet spot where quality is genuinely useful, VRAM requirements are affordable, and the hardware decision is tractable. This is the most common question the community asks — and the most common one that leads to expensive mistakes.

This guide gives you the benchmark numbers, the VRAM floor, and the honest trade-offs across four hardware categories. It is grounded in community-documented data, not fabricated first-party claims.

Parent guide: The Local-AI Hardware Buying Framework — the constraint-logic layer before you read any product-specific page.


What a 7B/8B Model Actually Needs

VRAM Floor

A 7B or 8B model in Q4_K_M quantization occupies approximately 4.3–4.8 GB of model weights. Add the runtime overhead (~0.5–1 GB) and a 4,096-token KV cache (~0.5 GB), and the practical minimum is 6 GB of fast memory.

At Q8_0 (higher quality, roughly 99% of float16 performance), the model weights expand to approximately 8 GB. Total runtime requirement: 9–10 GB.

At F16 (full quality, no quantization loss), the model needs approximately 14–16 GB. Runtime total: 15–17 GB.

The practical recommendation for most buyers: run Q4_K_M if your hardware has 8 GB or less of VRAM; run Q8_0 if you have 12–16 GB; run F16 only if you have 20+ GB and quality matters for your use case. The throughput cost of Q8 over Q4 is approximately 30–40%; the quality gain is measurable on factual recall and coding tasks, but invisible for casual chat.

What Throughput Feels Like

For interactive use — chat, code completion, document writing:

  • Less than 10 tok/s: The model is visibly lagging. You’ll finish thinking before it finishes the sentence. Acceptable for background batch jobs; frustrating for live interaction.
  • 10–20 tok/s: Passable for chat. Feels slow compared to cloud APIs but usable.
  • 20–50 tok/s: The comfortable range. Faster than most people read; responses feel responsive.
  • 50+ tok/s: You stop noticing the generation delay. The bottleneck becomes your own thinking, not the hardware.

A 7B Q4_K_M model on the right hardware runs comfortably in the 30–120 tok/s range depending on the platform. The difference between the bottom and top of that range is real, but only you know how much it matters for your workflow.


First-Party Benchmark: Base M4 / 16 GB (LocalRig-Measured)

Unlike the community-cited figures elsewhere in this guide, the numbers in this section are first-party — measured by LocalRig on owned hardware, with a published, reproducible methodology and pinned runtime versions.

  • Hardware: Apple M4 (base, 10-core), 16 GB unified memory — Mac mini (Mac16,10)
  • Model: Llama 3.1 8B Instruct, Q4_K_M — the same GGUF weights run by both engines
  • Date: 2026-06-27
Runtimetok/s (generation)How it was measured
llama.cpp b9820 (llama-bench)18.4tg128 generation throughput, Metal backend
Ollama 0.30.11 (flash attention on)19.5eval throughput on a real prompt, 4,096-token context, 3 passes averaged

Two findings worth more than the raw number

1. There is no faster engine hiding behind Ollama — so run llama.cpp (or MLX) directly. Ollama is llama.cpp under the hood: it embeds the same Metal kernels, which is why the two land within ~1 tok/s of each other here (the small gap is measurement noise — Ollama’s eval throughput on a real prompt vs llama-bench’s tg128, plus Ollama’s flash-attention default — not a real engine advantage). Because the speed is identical, the choice comes down to control and transparency, and that’s where we land on llama.cpp for portable/CPU work and MLX for Apple-native workflows: pinned builds you can reproduce, explicit quantization and flags, and a clean path to a real serving stack later. Ollama’s convenience hides those defaults behind an opaque wrapper, which is why it’s not our recommended runtime — see how to run LLMs locally for the full engine breakdown.

2. The base M4 is slower than the community ranges suggest. The widely-cited “M4 Mac Mini: 30–50 tok/s” reflects the higher-bandwidth M4 Pro or optimistic configs. The base M4 — 16 GB, ~120 GB/s bandwidth — delivers about 19 tok/s on an 8B Q4 model. That’s still genuinely usable for interactive chat, but verify the exact chip (base M4 vs M4 Pro) before you buy: the Pro’s memory-bandwidth advantage is real and shows up directly in throughput.

Reproduce it: PYTHONPATH=src python3 scripts/run-bench.py --model llama3.1:8b-instruct-q4_K_M --quant Q4_K_M --runtime llama.cpp --runtime-version b9820 --context 4096 --batch 512 --hardware "Apple M4, 16 GB" (via llama-bench) — and the matching --runtime ollama run for the cross-check above. The full methodology and a machine-readable Dataset (Schema.org) are published alongside this article.


Benchmark Data: What These Options Actually Deliver

All figures below are community-cited data. Source URLs and collection methodology are in the Sources section and the methodology frontmatter field. None of these numbers are fabricated or first-party-claimed without a documented run.

Master Comparison Table

HardwareVRAM / RAM7B Q4_K_M tok/s7B Q8_0 tok/sCan Run 13B Q4?Power (TDP)Community Source
RTX 4090 24GB24 GB GDDR6X120–16080–110Yes (fits in 8 GB)~450Wr/LocalLLaMA, llama.cpp benchmarks
RTX 3090 24GB24 GB GDDR6X80–11060–80Yes~300–350Wr/LocalLLaMA, llama.cpp benchmarks
Apple M3 Max 128GB128 GB unified50–6538–52Yes (comfortably)~35–45WCommunity Ollama benchmarks, 2025
Apple M2 Ultra 192GB192 GB unified70–8050–65Yes~60–80WSimon Willison benchmarks, 2024–2025
RTX 3060 12GB12 GB GDDR640–6030–45Tight (13B Q4 = ~8 GB)~170Wllama.cpp community benchmarks
Apple M4 Pro Mac Mini 24GB24 GB unified30–5022–38No (13B Q4 needs ~8 GB; fits)~20–30Wr/LocalLLaMA, Nov–Dec 2025
Apple M4 16GB (base, Mac mini)16 GB unified18.4 (llama.cpp) / 19.5 (Ollama)n/aNo (16 GB)~20–30WLocalRig first-party, 2026-06-27

Reading this table: tok/s values are community-documented ranges from cited sources. Your specific result will vary by runtime version, batch configuration, context length, and system thermal state. Treat these as planning estimates.


The Apple M3 Max with 128GB of unified memory is the top pick in this guide, and for reasons that go beyond marketing.

Why Apple Silicon Is Different for This Workload

Standard GPUs have a hard split: fast memory (GDDR6X VRAM, on the graphics card) and slow memory (system DDR5 RAM, connected via PCIe). If a model exceeds VRAM, it spills to system RAM and throughput collapses — from 90 tok/s to 4 tok/s. The model is still running; you just can’t use it.

Apple Silicon uses a unified memory architecture. There is no split. The CPU, GPU, and Neural Engine all draw from the same physical pool at the same bandwidth. An M3 Max with 128GB has 400 GB/s of memory bandwidth across that entire pool. An RTX 3090’s GDDR6X has 936 GB/s — faster for pure GPU compute — but on Apple Silicon, there is no VRAM ceiling to hit.

For 7B models at Q4_K_M, this doesn’t matter much. The model fits in 6 GB; any platform above that threshold handles it. Where Apple Silicon’s architecture becomes the decisive factor is at 32B and 70B model sizes — but the throughput and efficiency characteristics at 7B are already competitive and worth documenting.

M3 Max Benchmark Data (Community-Cited)

Based on community Ollama benchmarks aggregated on r/LocalLLaMA (2025):

  • Llama 3.1 8B Q4_K_M via Ollama on M3 Max (96–128GB): approximately 50–65 tok/s at a 4,096-token context
  • Llama 3.1 8B Q8_0 via Ollama on M3 Max: approximately 38–52 tok/s

For context: the M2 Ultra at 192GB achieves approximately 70–80 tok/s on the same model per Simon Willison’s documented benchmarks (simonwillison.net, 2024–2025). The M3 Max trades 10-20 tok/s vs the Ultra tier to save approximately $2,000-$3,000 in hardware cost and substantial power.

M3 Max Configuration Notes

The M3 Max ships in 96GB and 128GB unified memory configurations. For 7B models, both are equivalent — a 7B Q4 model needs 6 GB, leaving you 90+ GB free for other applications. The differentiation between 96GB and 128GB becomes meaningful at 32B models (18 GB needed) or if you run multiple models simultaneously.

M3 Max 128GB MacBook Pro pricing as of mid-2026: approximately $3,999–$4,499 for the 14-inch, $4,499–$5,199 for the 16-inch. These are high prices for a 7B workload — the RTX 3090 path at $500-$700 used delivers better throughput at a fraction of the cost. The M3 Max justifies itself when you also want to run 32B models, need the laptop form factor, or care deeply about the power and noise floor.

Check Apple M3 Max MacBook Pro pricing on Amazon →


Option B — RTX 3090 (24GB): Best Throughput per Dollar for 7B

The RTX 3090 is the community’s most recommended discrete GPU for local LLM inference in 2025-2026. The reasons are specific and documented.

Why the 3090 Still Makes Sense

  • 24GB GDDR6X VRAM: The single most important spec. 24GB lets you run a 7B model at Q8_0 quality (8 GB) with 16 GB to spare — enough to keep large context windows open. It also fits a 13B model at Q4_K_M (approximately 8 GB) with meaningful headroom.
  • 936 GB/s memory bandwidth (per NVIDIA specifications): High bandwidth is what delivers tok/s on LLM inference. The 3090’s bandwidth is the reason it outperforms the RTX 4080 (which has less VRAM and lower bandwidth) for this specific workload.
  • Used market pricing: A used RTX 3090 sells for approximately $500–$800 on eBay as of mid-2026 (check current listings; the market fluctuates with NVIDIA’s new releases). At $600, you are paying roughly $25 per GB of GDDR6X at 936 GB/s bandwidth. There is no better deal in the GPU market for inference.

RTX 3090 Benchmark Data (Community-Cited)

From r/LocalLLaMA benchmark threads and llama.cpp community benchmarks (2024-2025):

  • Llama 3.1 8B Q4_K_M via llama.cpp (CUDA) on RTX 3090: approximately 80–110 tok/s at 4,096-token context
  • Llama 3.1 8B Q8_0 via llama.cpp on RTX 3090: approximately 60–80 tok/s

These figures reflect community results. The range reflects differences in CUDA version, system RAM configuration, PCIe lane availability, and whether GPU-only or GPU+CPU compute was used. The high end of the range (110 tok/s) is achievable with current llama.cpp builds, recent CUDA versions, and a PCIe 4.0 x16 slot.

What you will not get: multi-GPU scaling. Two RTX 3090s connected via PCIe (without NVLink) do not deliver 2x the throughput. LLM inference is memory-bandwidth-bound, not compute-bound, and PCIe bandwidth between GPUs is a bottleneck. If you need more than 24GB VRAM for a single model, the used RTX 3090 path is the wrong architecture — see the 70B model guide for that discussion.

RTX 3090 Buying Notes

The RTX 3090 is available used only — it launched in 2020, and NVIDIA has discontinued it. This means:

  • No manufacturer warranty on used cards
  • Check seller feedback, require photos of the heatsink and HDMI/DP ports
  • Budget for replacement thermal paste application on any card over two years old
  • Mining-used cards are available at lower prices but carry higher failure risk; verify the use history if possible

Browse used RTX 3090 on eBay →

RTX 3090 on Amazon (new third-party and refurbished listings) →


Option C — RTX 3060 12GB: The True Budget Floor

If the budget is under $300 and the use case is primarily 7B models, the RTX 3060 12GB is the minimum discrete GPU worth recommending.

RTX 3060 12GB Benchmark Data (Community-Cited)

From llama.cpp community benchmarks (2024–2025):

  • Llama 3.1 8B Q4_K_M via llama.cpp (CUDA) on RTX 3060 12GB: approximately 40–60 tok/s at 4,096 tokens

This is meaningful throughput — 40 tok/s is faster than most people read. However, the 12GB VRAM creates a ceiling:

  • 7B Q4_K_M: fits comfortably (~5 GB used, 7 GB free)
  • 7B Q8_0: fits (8 GB used, 4 GB free — but tight with larger contexts)
  • 7B F16: does not fit (14 GB needed)
  • 13B Q4_K_M: marginal (8 GB needed; 4 GB headroom for context)
  • 13B Q8_0: does not fit (14 GB needed)

If there is any chance you will want to run 13B or 14B models in the next two years, the RTX 3060 12GB is a short-term purchase. The RTX 3090 at double the price delivers double the VRAM, significantly higher bandwidth, and 60-80% more throughput. Buy the 3060 only if $300 is a hard ceiling.

RTX 3060 12GB on Amazon →

Browse used RTX 3060 12GB on eBay →


Option D — Apple Mac Mini M4 (24GB): The Quiet Compact Option

The M4 Mac Mini launched in late 2024 and delivers a genuinely impressive amount of AI inference capability in a compact, efficient, sub-$800 package when configured with 24GB of unified memory.

M4 Mac Mini Benchmark Data (Community-Cited)

From r/LocalLLaMA and Ollama GitHub discussions (November–December 2025):

  • Llama 3.1 8B Q4_K_M via Ollama on M4 Pro Mac Mini (24GB): approximately 30–50 tok/s

LocalRig first-party measurement (base M4, 16 GB): 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11, which runs the same llama.cpp kernels underneath), measured 2026-06-27 — see the First-Party Benchmark section above. Run llama.cpp (or MLX) directly; the throughput is the same. The base M4 lands well below the 30–50 community range, which reflects the higher-bandwidth M4 Pro. If you’re buying a Mac Mini specifically for 8B inference, the M4 Pro’s memory bandwidth is the upgrade that matters — not the core count.

This range reflects the M4’s memory bandwidth (approximately 120 GB/s for the base M4) versus the M3 Max’s 400 GB/s. The M4 Pro in the Mac Mini (available as a build-to-order option) provides higher bandwidth and lands closer to 50-60 tok/s on the same model.

M4 Mac Mini Trade-offs

The case for the M4 Mac Mini: it is the lowest-cost entry into Apple Silicon inference that still delivers interactive-grade throughput for 7B models. At $799 (24GB, M4) or approximately $1,299 (24GB, M4 Pro), it undercuts any RTX 3090 system on total build cost while consuming 15-25W under load. It also fits in a desk drawer.

The case against: the 24GB ceiling is real. A 13B Q4 model needs approximately 8 GB; it fits, but you’re at 33% of total memory. A 14B Q4 model needs approximately 8-9 GB; also fits. A 32B model at Q4 needs approximately 18 GB — this does not fit in 24GB with runtime overhead. If you anticipate running larger models, the M4 Mac Mini is an upgrade path, not a long-term platform.

Apple Mac Mini M4 on Amazon →


Decision Matrix: Which Option for Your Situation?

SituationRecommended OptionWhy
Maximum throughput on 7B, budget $500-$800Used RTX 3090 (eBay)80-110 tok/s; best throughput/dollar
Will also run 32B or 70B modelsApple M3 Max 128GB or M2 UltraOnly options that keep large models in fast memory
Quiet operation, low powerMac Mini M4 24GB or Apple M3 Max5-10x more efficient than discrete GPU
Already have a gaming PCAdd RTX 3090 or 3060Cheapest incremental upgrade; no new platform
Hard budget under $300RTX 3060 12GB (used or new)40-60 tok/s; 12GB ceiling acknowledged
Need a laptop that also does local AIMacBook Pro M3 MaxOnly laptop with sufficient memory bandwidth + capacity
Experimenting, no hardware spendCPU-only (llama.cpp on system RAM)3-10 tok/s; validates the use case before buying

Who This Hardware Is NOT For

The options in this guide are for interactive inference of 7B and 8B models for personal or small-team use. This is not the right guide if:

  • You need to fine-tune or train models: Training has fundamentally different requirements — you need CUDA for most training frameworks, gradient checkpointing, and significantly more VRAM than inference. The Apple Silicon path is not the training path.
  • You need to serve multiple simultaneous users: A single RTX 3090 or M3 Max handles one user at a time well; under heavy multi-user load, you need batching-aware hardware and a different throughput calculus.
  • You want image generation: Stable Diffusion and Flux use VRAM differently. The 24GB RTX 3090 is still excellent for image generation, but the sizing logic is not the same. We cover that separately.
  • You plan to run models above 13B: This guide ends at 13B. For 32B and 70B models, the constraint analysis and hardware recommendations change substantially. See the 70B model guide (32B guide coming soon) in this cluster.
  • Your use case is bulk document processing that can run overnight: If throughput per hour matters more than latency per token, you may want a different GPU (or a cloud-GPU burst approach). The Local vs Cloud cluster covers this honestly.

Methodology

All benchmark figures in this article are community-cited data aggregated from named public sources with dates. None of these numbers are first-party measurements taken by LocalRig. Specific methodology:

  • Apple Silicon throughput figures sourced from Simon Willison’s documented benchmark experiments (2024-2025) and r/LocalLLaMA community benchmark threads (2025).
  • NVIDIA GPU throughput figures sourced from llama.cpp GitHub benchmark issues and r/LocalLLaMA community benchmark threads (2024-2025).
  • Ranges are reported where multiple community sources report differing results; the spread reflects real variance in runtime version, system configuration, context length, and thermal state.
  • All throughput figures assume: Llama 3.1 8B Instruct GGUF Q4_K_M, 4,096-token context, single-user, no batching unless noted.
  • Figures are reported as of dataDate: 2026-06-27. Newer runtime versions (Ollama, llama.cpp) frequently improve throughput; the figures above may be conservative for users running up-to-date software.

This article now includes LocalRig’s first-party benchmark of the base Apple M4 (16 GB) — see the First-Party Benchmark section above — measured on owned hardware with pinned runtime versions (Ollama 0.30.11, llama.cpp b9820 via llama-bench), a published methodology, and a machine-readable Schema.org Dataset. The remaining figures are community-cited and labeled as such throughout. As LocalRig benchmarks more owned hardware, additional first-party rows will be added to the benchmark database under the same standard.

Sources

Full source citations are in the frontmatter sources: list. Key references:

  • Simon Willison’s Apple Silicon LLM benchmarks: simonwillison.net (2024–2025)
  • r/LocalLLaMA community benchmark threads: RTX 3090, RTX 4090, Mac Mini M4, M3 Max results (2024–2025)
  • llama.cpp GitHub benchmark issues and community PRs: github.com/ggerganov/llama.cpp
  • TheBloke GGUF model size reference for quantization memory estimates: huggingface.co/TheBloke
  • NVIDIA RTX 3090, 3060 product specifications: nvidia.com
  • Apple M3 Max, M4 specifications: apple.com

Sources

  • Simon Willison's LLM benchmarks on M2 Ultra — simonwillison.net/2024/Mar/8/gpt4/ and linked experiments (2024–2025)
  • r/LocalLLaMA: 'RTX 3090 llama.cpp benchmark thread' — reddit.com/r/LocalLLaMA (multiple posts, 2024–2025)
  • llama.cpp GitHub issue #3 and benchmark PRs: github.com/ggerganov/llama.cpp
  • Mac Mini M4 Ollama community benchmarks — r/LocalLLaMA and Ollama GitHub discussions (Nov–Dec 2025)
  • TheBloke GGUF model size reference: huggingface.co/TheBloke (Llama-2-7B-GGUF, Llama-3.1-8B-GGUF)
  • NVIDIA RTX 3090 product specifications: nvidia.com (24GB GDDR6X, 936 GB/s bandwidth)
  • NVIDIA RTX 4090 product specifications: nvidia.com (24GB GDDR6X, 1,008 GB/s bandwidth)
  • NVIDIA RTX 3060 product specifications: nvidia.com (12GB GDDR6, 360 GB/s bandwidth)
  • Apple M3 Max unified memory specifications: apple.com (96GB/128GB option, 400 GB/s bandwidth)
  • Apple M4 Mac Mini specifications: apple.com (16GB/24GB/32GB options)
  • Ollama model library: ollama.com/library