Best GPU for Local LLM Inference (2026): VRAM-per-Dollar Guide

Most “best GPU for local LLM” lists rank cards by gaming benchmarks or raw FLOPS. For running a language model on your own machine, that is the wrong scoreboard. Two specs decide almost everything: VRAM — whether the model fits at all — and memory bandwidth — how fast it generates tokens once it fits. Compute (FLOPS) matters far less than the marketing implies, because token generation is dominated by reading the model weights out of memory, not by arithmetic.

This guide is for someone choosing a GPU specifically to run local LLMs for personal or small-team inference: chat, coding assistance, document work, local agents. It is the product-decision layer of the GPU cluster. Before you read it, the sizing math lives in two places worth a click: What Is Quantization for how many gigabytes a model actually needs, and The Local-AI Hardware Buying Framework for the constraint logic. This page turns that logic into specific cards.

The cost of buying wrong is real and asymmetric. Buy too little VRAM and the model you want will not load — or it spills into system RAM and decode collapses from usable to unusable. Buy too much of the wrong thing (a high-FLOPS card with mediocre bandwidth, or a brand-new card at double the used price) and you overpay for throughput you will not notice. The goal here is to buy once, at the right tier, and not relitigate it in six months.

The core principle: VRAM = fit, bandwidth = speed

Hold these two sentences in your head and most of the confusion disappears:

VRAM determines what you can run. A model’s weights, plus runtime overhead, plus the KV cache for your context window, all have to fit in the GPU’s memory. If they do not fit, the model either fails to load or offloads layers to system RAM over PCIe — and PCIe is so much slower than VRAM that throughput falls off a cliff.
Memory bandwidth determines how fast it runs. Generating each token requires re-reading the model’s weights from memory. A card with more GB/s of bandwidth reads those weights faster and produces more tokens per second. This is why a card’s bandwidth predicts its decode speed far better than its FLOPS number.

For the per-quantization gigabyte math — why a 7B model is roughly 4–5 GB at Q4_K_M, ~8 GB at Q8_0, and ~14–16 GB at F16 — see What Is Quantization. The short version: the quantization level you choose sets the VRAM you need, and the VRAM you need sets the card.

24GB is the meaningful consumer tier. It is the point where you stop fighting the hardware. With 24GB you can run a 7B model at full Q8_0 quality with room for a large context window, fit a 13B model at Q4_K_M with headroom, and keep multiple models or long contexts resident. Below 24GB you are managing trade-offs constantly; at 24GB the common local workloads simply fit. That is why the two cards the community actually buys — the used RTX 3090 and the RTX 4090 — both carry 24GB, and why the budget tier is framed honestly as a compromise rather than a recommendation.

Master comparison table

All tok/s figures below are community-cited (r/LocalLLaMA, llama.cpp benchmark threads, 2024–2025), not independently verified by LocalRig. They assume a 7B model at Q4_K_M quantization, roughly 4,096-token context, single user. Treat them as planning ranges — your result varies with runtime version, CUDA build, PCIe lanes, and thermal state. These same figures are documented in the 7B/8B hardware guide, which is the source for the numbers here.

GPU	VRAM	~7B Q4_K_M tok/s	Power (TDP)	New / Used	Price range
RTX 4090	24 GB GDDR6X	~120–160	~450W	New	new retail
RTX 3090	24 GB GDDR6X	~80–110	~300–350W	Used	~$500–$800 (eBay)
RTX 3060	12 GB GDDR6	~40–60	~170W	New / used	budget floor (<~$300)
Apple M3 Max 128GB (unified)	128 GB unified	~50–65	~35–45W	New	high (Mac tier)

The Apple M3 Max row is included for one reason: it is the path when your model exceeds 24GB of VRAM. It is not a GPU you slot into a PC. More on that in the “beyond 24GB” pick below.

The picks, by buyer constraint

There is no single winner here, and these are not ranked by what pays the most. They are ranked by which constraint you are optimizing.

Best throughput-per-dollar: used RTX 3090 24GB

This is the community standard for local inference, and the math is why. At roughly $500–$800 used, you get 24GB of GDDR6X and the high memory bandwidth that drives decode speed. At ~80–110 tok/s on a 7B Q4 model, it is fast enough that the bottleneck becomes your reading speed, not the card. Nothing else on the used market delivers 24GB at this bandwidth for this price. If you are buying one card to run local LLMs, start here.

Browse used RTX 3090 24GB on eBay →

Fastest single consumer card: RTX 4090 24GB (new)

The RTX 4090 is the fastest single consumer GPU for this workload, at ~120–160 tok/s on a 7B Q4 model. It carries the same 24GB of VRAM as the 3090, so it does not let you run larger models — it runs the same models faster. It also draws ~450W and sells new at a substantial premium over a used 3090. Buy it if you want maximum single-card decode speed and the wattage and price are acceptable; otherwise the 3090 captures most of the practical value for far less money.

Check RTX 4090 24GB pricing on Amazon →

Budget floor: RTX 3060 12GB

If the ceiling is hard at under ~$300, the RTX 3060 12GB is the entry point worth recommending. At ~40–60 tok/s on a 7B Q4 model it is genuinely usable — faster than most people read — and it sips ~170W. The honest caveat is the 12GB ceiling: a 7B model fits comfortably and a 13B Q4 model is tight, but you have little room for larger contexts, Q8 quality, or any model in the 14B+ range. If there is a real chance you will want bigger models within a couple of years, the 3060 is a short-term buy and the used 3090 is the better long-term value. Buy the 3060 only when the budget genuinely will not stretch further.

RTX 3060 12GB on Amazon → · Browse used RTX 3060 12GB on eBay →

Beyond 24GB / big models: 2× used 3090, or Apple Silicon

When the model you want exceeds 24GB — large 32B or 70B-class models at usable quantization — you have two honest paths, and neither is “buy a faster single card.”

2× used RTX 3090 (48GB total). Two 24GB cards give you the capacity to hold a bigger model. What they do not give you is double the speed. Inference is memory-bandwidth-bound and the cards talk over PCIe; without NVLink there is no linear throughput scaling. You are buying fit, not 2× tok/s. This is the community’s standard route to 48GB of local VRAM, and it is the right one when capacity is the constraint — just go in with correct expectations.
Apple Silicon unified memory. A Mac with large unified memory (e.g. M3 Max 128GB at ~50–65 tok/s on a 7B Q4) sidesteps the VRAM ceiling entirely: the CPU and GPU share one memory pool, so a model that cannot fit on any 24GB GPU can simply load. The trade-off is lower bandwidth than dedicated GDDR6X, so when a model also fits on an NVIDIA card, the NVIDIA card decodes faster. For the full Apple Silicon breakdown and the first-party benchmarks, see the 7B/8B hardware guide.

A note on low-bandwidth unified memory, so the Apple path is not oversold: not every Mac is an M3 Max. LocalRig’s first-party measurement of a base Apple M4 (16 GB) landed at 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11) on the same Llama 3.1 8B Q4_K_M weights (2026-06-27). That is usable for interactive chat, but it is well below the M3 Max — because the base M4’s memory bandwidth is much lower. Unified memory removes the fit ceiling; it does not guarantee speed. Bandwidth still rules decode.

Multi-GPU reality: capacity, not linear speed

This is the most common expensive misunderstanding, so it gets its own section. Putting two GPUs in a box does not double your tokens per second for single-stream inference.

The reason traces straight back to the core principle. Decode is memory-bandwidth-bound: each token requires reading the weights, and when those weights are split across two cards, the cards must coordinate over PCIe. PCIe bandwidth is a fraction of on-card VRAM bandwidth, so it becomes the bottleneck. Consumer 3090s without NVLink see no meaningful tensor-parallel speedup for this workload. What the second card buys you is VRAM capacity — the ability to load a model that would not fit on one card. That is valuable, and it is the right reason to go dual-GPU. Buying a second card hoping for 2× throughput is not. If serving many concurrent users is the goal (a different problem than single-stream speed), the engine and batching strategy matter more than the card count — see how to run LLMs locally for that.

Buying used: notes that protect your money

The used market is where the value lives for this niche — the RTX 3090 is discontinued and only sold secondhand — but used GPUs carry risk that a new-in-box card does not. A few habits keep you out of trouble:

No manufacturer warranty. Assume the card is sold as-is. Weight that into the price you are willing to pay, and check the seller’s feedback and return policy before bidding.
Demand photos of the actual card — the real heatsink and the HDMI/DisplayPort outputs, not a stock image. Missing brackets, bent fins, or a generic photo are reasons to walk.
Budget for fresh thermal paste on any card more than a couple of years old. Dried-out paste causes thermal throttling, which quietly costs you tok/s. A repaste is cheap insurance.
Mining-card risk is real but manageable. Ex-mining cards sell cheaper and many run fine, but they ran hot for long hours. Ask about use history; if the discount is steep and the history is vague, factor in the higher failure odds. A working card at a fair price beats a “deal” you have to RMA yourself.

For where these cards sit in a full build — PSU sizing, cooling, and the rest of the rig — the buying framework and the upcoming homelab cluster carry that detail; this page is about the GPU choice itself.

Who This Is NOT For

This guide optimizes for single-stream local LLM inference on consumer hardware. It is the wrong guide if:

You are training or fine-tuning models. Training has different requirements — far more VRAM, CUDA-bound frameworks, and a throughput calculus this page does not cover. A used 3090 can do light fine-tuning, but the sizing logic here is built for inference, not training runs.
You are serving many concurrent users in production. A single 24GB card serves one stream well; under real multi-user load you need batching-aware serving (vLLM/SGLang) and a different hardware budget. Start with how to run LLMs locally and treat this as the on-ramp.
Your model already exceeds 24GB and you expect linear multi-GPU speed. It does not work that way — multi-GPU buys capacity, not 2× throughput. If you need big models fast, price datacenter hardware or cloud against the honest break-even before buying consumer cards in pairs.
You want the absolute cheapest path and have not done the capacity math. If you have not sized your model yet, the card is the wrong first question. Read What Is Quantization and the buying framework first.

Sources

All throughput figures in this guide are community-cited (r/LocalLLaMA, llama.cpp benchmark threads, 2024–2025) and not independently verified by LocalRig, except the base Apple M4 numbers, which are first-party. Full citations are in the frontmatter sources: list. Key references:

r/LocalLLaMA community benchmark threads — RTX 3090, RTX 4090, RTX 3060 results via llama.cpp (CUDA) and Ollama (2024–2025).
llama.cpp GitHub benchmark issues and community PRs: github.com/ggml-org/llama.cpp.
LocalRig first-party benchmark: base Apple M4, 16 GB — 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11), Llama 3.1 8B Q4_K_M, 2026-06-27. Methodology in the 7B/8B hardware guide.
NVIDIA RTX 3090 / RTX 4090 / RTX 3060 specifications and Apple M3 Max unified memory specifications (nvidia.com, apple.com).

Prices and availability are as of dataDate: 2026-06-28; the used GPU market moves with each new NVIDIA release, so verify current listings before buying.