Software & Runtimes

How to Run LLMs Locally: Which Inference Engine for Your Rig (2026)

Top Pick llama.cpp (portable default) · MLX (Apple Silicon) · vLLM (CUDA serving)
Check Price

Most people pick a local inference engine first and then try to make their hardware fit it. That is backwards, and it is the most common reason a local setup ends up slow, out of memory, or stuck on the wrong tool. The better order is: decide what hardware you actually have, what your workload looks like, and how many people (or agents) need to hit the model at once. The engine falls out of those answers.

This page is the decision layer for the Software & Runtimes cluster. It maps a rig to an engine, explains the one or two facts that drive the choice, and points you to the hardware guides when the answer is “you need a different box.” It is for someone who can already run a model and now wants to run it well — not a step-by-step install (those live on the per-engine how-to pages).

The one decision that drives everything: prefill vs decode

Local inference has two phases, and they stress different parts of your machine.

  • Prefill reads your prompt and builds the initial KV cache. It is compute-bound. Long prompts (RAG, big system prompts, long code files) live here.
  • Decode generates one token at a time, re-reading the model weights and the KV cache on every token. It is memory-bandwidth-bound. Short prompts with long answers live here.

That single distinction explains most engine behavior. If your work is short-prompt, long-answer chat, decode dominates and memory bandwidth and batching matter most. If your work is long-prompt, short-answer (summarize this 80K-token document), prefill dominates and attention kernels and chunked prefill matter. If many users hit the model, the scheduler matters more than raw speed.

It also explains the trap behind “will it fit?” Fit is not speed. Whether a model loads is set by memory capacity; how fast it decodes is set by memory bandwidth. A model can fit comfortably in a large unified-memory Mac and still decode slower than a smaller GPU with much higher bandwidth. For the capacity side of that math — how many GB a model actually needs at each quantization — see What Is Quantization and the sizing logic in the buying framework.

The one-page decision guide

Your situationEngineWhy
Laptop, CPU-only, edge, or unusual hardwarellama.cppRuns almost anywhere; CPU+GPU hybrid offload; GGUF
Apple Silicon (Mac mini / Studio / MacBook)MLX / MLX-LMNative to unified memory and Metal
Single consumer NVIDIA GPU, local usellama.cpp (or ExLlamaV2)Simple, portable; ExLlamaV2 for tuned EXL2 quant speed
2–4 consumer NVIDIA GPUs / local MoEExLlamaV3Tensor/expert parallelism on consumer cards
Serving multiple users / productionvLLMPagedAttention, continuous batching, real scheduler
Long-context / MoE / routing under loadSGLangPrefill-decode disaggregation, prefix caching
Datacenter NVIDIA, maximum performanceTensorRT-LLM (+ Dynamo)This is the cloud boundary — see below

The rest of this guide is why.

llama.cpp — the portable default

If your hardware is a laptop, a CPU-only box, a single consumer GPU, or anything that is not a tidy NVIDIA server, start with llama.cpp. It runs on Apple Silicon (Metal), x86 and ARM CPUs, NVIDIA (CUDA), AMD (HIP/Vulkan), and more, and it does CPU+GPU hybrid offload so you can run a model that does not fully fit in VRAM. Its server speaks OpenAI- and Anthropic-compatible APIs, so most app code talks to it unchanged.

What it is not built for is fleet-scale, multi-node production serving — its multi-node path is explicitly experimental, and it is not the tool for many concurrent users on multiple GPUs. For a single rig, that does not matter. For a public service, it does.

This is our default recommendation for portability and for getting a model running on hardware you already own. If you are buying a GPU to pair with it, the used market is where the value is — see the 7B/8B hardware guide for the full breakdown, and browse used RTX 3090 listings on eBay for the best throughput-per-dollar card for local inference.

MLX — the Apple Silicon native

On a Mac, MLX (and MLX-LM) is the native path. Apple Silicon’s unified memory lets the CPU and GPU share one pool, which changes the question from “does it fit in VRAM?” to “does it fit in memory, and can the memory system feed the GPU fast enough?” Large quantized models that are impossible on a 24 GB consumer GPU can load on a Mac with enough unified memory — at the cost of lower bandwidth than dedicated HBM, so decode is slower than a high-end NVIDIA card when the model also fits there.

Use MLX for Mac-native development and local inference. llama.cpp remains a fine portable fallback on the same machine if you need GGUF weights or a specific build. If you are choosing a Mac for this, the unified-memory capacity is the spec that matters most for larger models — check current Mac Studio and Mac mini configurations on Amazon.

vLLM — the production serving default

The moment the question becomes “serve this to multiple users” rather than “run this for me,” reach for vLLM. It is the default open-source production server: PagedAttention for KV-cache memory management, continuous batching, chunked prefill, prefix caching, broad quantization support, and tensor/pipeline/expert parallelism. It runs on NVIDIA and AMD and exposes OpenAI- and Anthropic-compatible APIs.

vLLM does not remove the need to think about your system. You still tune batching, context length, GPU memory utilization, and parallelism layout, and on hardware without NVLink, pipeline parallelism can beat tensor parallelism. But it is the right starting point for serving open models in production, and the right tool to simulate production behavior locally before you rent datacenter hardware.

Consumer multi-GPU — ExLlamaV2 and ExLlamaV3

If you run quantized models on one modern NVIDIA card and want it to punch above its weight, ExLlamaV2 is the enthusiast’s local CUDA engine (EXL2 format, paged attention, speculative decoding). ExLlamaV3 extends that to 2–4 consumer GPUs and local mixture-of-experts models with tensor and expert parallelism and an OpenAI-compatible server. Expect rougher edges than vLLM in exchange for getting more out of consumer cards. If you are building a dual-GPU box, browse used RTX 3090 cards on eBay — two used 24 GB cards is the community’s standard path to 48 GB of local VRAM.

The datacenter boundary — where local stops

TensorRT-LLM is NVIDIA’s maximum-performance stack (FP8/FP4, custom kernels), SGLang is built for hostile serving traffic (long context, MoE, prefill-decode disaggregation), and NVIDIA Dynamo orchestrates fleets across many nodes. These are real and excellent — and they are not local-rig tools. They assume H100/H200/B200-class hardware and NVIDIA-only datacenters. If your honest answer is “I need that,” the cost calculus changes and you should compare against renting: see the Local vs Cloud cluster before buying anything in this tier. Naming the boundary is part of the decision; pretending a single desktop competes with a GPU cluster is not.

Why we do not recommend Ollama

Ollama is convenient, and it works. It is also a wrapper around llama.cpp — it embeds the same kernels — so it is not a faster engine hiding behind a nicer command. LocalRig measured this directly on a base Apple M4 (16 GB): llama.cpp at 18.4 tok/s and Ollama at 19.5 tok/s on the same Llama 3.1 8B Q4_K_M weights (first-party, 2026-06-27). The roughly one-token-per-second gap is measurement noise from slightly different methods, not an engine advantage.

Because the speed is the same, the choice comes down to control and transparency, and that is where we land on running the real thing. llama.cpp and MLX give you pinned builds you can reproduce, explicit control over quantization and flags, and a clean path to a production serving stack later. Ollama’s convenience comes from hiding those defaults — model versioning and quantization choices happen behind a wrapper. For a project you intend to keep, run llama.cpp (or MLX on a Mac) directly. The full numbers are in the 7B/8B hardware guide.

Hardware recipes (engine + where to buy)

  • CPU-only box: llama.cpp on system RAM. Slow (single-digit tok/s) but free to try before buying a GPU.
  • Mac mini / Mac Studio: MLX native; llama.cpp for GGUF portability. Capacity is the spec — Mac options on Amazon.
  • Single used RTX 3090 (24 GB): llama.cpp or ExLlamaV2; vLLM if you serve others. Best throughput-per-dollar — used 3090 on eBay.
  • Dual/quad consumer NVIDIA: ExLlamaV3 for quantized multi-GPU or local MoE; vLLM/SGLang if serving behavior matters.
  • 8×H100 and up: vLLM or SGLang first, TensorRT-LLM if NVIDIA-only and the tuning pays off — but price it against cloud first.

Who This Is NOT For

  • People who just want the single easiest button. If you never intend to tune anything, control quantization, or move to a serving stack, a convenience wrapper will feel simpler day one. This guide optimizes for a setup you can reproduce and grow, not the absolute shortest path to a first token.
  • Production teams serving customers at scale. A single-rig engine choice does not cover routing, autoscaling, observability, and SLA behavior. Start at vLLM or SGLang and benchmark against TensorRT-LLM; treat this page as the on-ramp, not the architecture.
  • Trainers and fine-tuners. This is an inference guide. Training has different requirements (CUDA for most frameworks, far more VRAM), and the Apple Silicon path in particular is not the training path.
  • Anyone whose model does not fit yet. If you have not done the capacity math, the engine is the wrong first question. Start with What Is Quantization and the buying framework, then come back here.

Sources

  • Ahmad Osman, “Inference Engines for LLMs & Local AI Hardware (2026 Edition)” and “GPU Memory Math for LLMs (2026 Edition),” x.com/TheAhmadOsman — the prefill/decode and engine-family framing (accessed 2026-06-28).
  • LocalRig first-party benchmark: base Apple M4, 16 GB — llama.cpp b9820 (18.4 tok/s) and Ollama 0.30.11 (19.5 tok/s), Llama 3.1 8B Q4_K_M, 2026-06-27. See the 7B/8B hardware guide for methodology.
  • Project documentation: llama.cpp (github.com/ggml-org/llama.cpp), vLLM (docs.vllm.ai), Apple MLX (github.com/ml-explore/mlx), accessed 2026-06-28.

Sources

  • Ahmad Osman, 'Inference Engines for LLMs & Local AI Hardware (2026 Edition)' — x.com/TheAhmadOsman (accessed 2026-06-28)
  • Ahmad Osman, 'GPU Memory Math for LLMs (2026 Edition)' — x.com/TheAhmadOsman (accessed 2026-06-28)
  • LocalRig first-party benchmark: base Apple M4, 16 GB — llama.cpp b9820 and Ollama 0.30.11, 2026-06-27
  • llama.cpp project documentation and supported backends: github.com/ggml-org/llama.cpp (accessed 2026-06-28)
  • vLLM documentation — parallelism and serving: docs.vllm.ai (accessed 2026-06-28)
  • Apple MLX / MLX-LM documentation: github.com/ml-explore/mlx (accessed 2026-06-28)