Software & Runtimes

Ollama vs llama.cpp vs vLLM in 2026: Which Runtime Fits Your Hardware and Workload

Every runtime comparison written since early 2026 says roughly the same thing: Ollama is easy, llama.cpp is fast, vLLM is for servers. That framing is not wrong, but it is not useful either — it tells you nothing about your box and your workload. The right way to pick is constraint-first: how many people hit this model at once, how much control you need over the build, and whether your hardware is even supported by all three.

This guide is the runtimes cluster companion to how to run LLMs locally, which covers installation mechanics. This page is the decision layer: which of the three you should actually run, and why the “just use X” advice you’re reading elsewhere skips the parts that matter.

What does each runtime actually do differently?

Ollama, llama.cpp, and vLLM are not three implementations of the same idea. They are three different bets on what matters most.

llama.cpp is a C/C++ inference engine built directly on the ggml tensor library, targeting GGUF-quantized models. It runs on almost anything — CPU-only, CUDA, ROCm, Metal, Vulkan, even niche and older hardware the other two don’t bother supporting — because the project prioritizes portability and low-level control over ergonomics. You compile it, set flags, and run a binary against a model file.

Ollama is a packaging and serving layer that wraps model management, a REST API, and a CLI around a GGUF-based inference core. Its pitch is ollama run llama3 and you’re generating tokens in one command, with a model registry that handles downloads and versioning for you. It historically ran on top of llama.cpp; more on that below.

vLLM is a serving engine built for throughput under concurrent load, using continuous batching and PagedAttention to keep a GPU busy across many simultaneous requests instead of one at a time. It targets full-precision or GPTQ/AWQ-quantized models more than GGUF, and it assumes you’re standing up an API endpoint for multiple users, not chatting with yourself. For the quantization-format tradeoffs that decide which of GGUF, GPTQ, or AWQ fits your runtime, see GGUF vs GPTQ vs AWQ.

The one-line version: llama.cpp is the engine, Ollama is the friendly car built around a version of that engine, and vLLM is the commercial truck built for a loading dock, not a driveway.

Did Ollama really move away from llama.cpp?

Community reporting (not independently verified by LocalRig) indicates Ollama shifted toward a more direct integration with the underlying ggml library in mid-2025, rather than continuing to build straight on top of upstream llama.cpp releases. Flag this as secondhand: LocalRig has not audited Ollama’s source history commit-by-commit to confirm the scope or timing precisely, and Ollama’s own release notes are the authoritative source if you need certainty.

What this means practically: Ollama is no longer simply “llama.cpp with a nicer CLI.” It has its own release cadence, its own model format handling, and its own performance characteristics that can diverge from vanilla llama.cpp — sometimes faster, sometimes slower, depending on the model and build. That divergence is part of why blanket “Ollama is just llama.cpp so use llama.cpp instead” arguments are dated advice from before the split. It’s also part of why the attribution controversy below has legs — a project that increasingly stands on its own technical footing still built its initial reputation on someone else’s engine.

What is the “Don’t Use Ollama” controversy, and is it fair?

A wave of “Don’t Use Ollama” posts spread on r/LocalLLaMA in May and June 2026 (community-cited, not independently verified by LocalRig). Strip out the noise and there are two separate arguments, and they deserve separate answers.

The open-source-ethics critique is about attribution and licensing conduct: that Ollama built its early product and reputation on llama.cpp’s engine without, in critics’ view, crediting the upstream project proportionally to its contribution. This is a legitimate governance conversation about how derivative open-source products should credit and compensate the projects they’re built on — it is not a technical performance complaint, and reasonable people land on different sides of it depending on how much weight they give to license compliance versus community norms versus commercial reality. LocalRig isn’t the venue to adjudicate it; if it matters to you, read the actual GitHub threads and Ollama’s own responses rather than the summary posts.

The technical critique is that Ollama’s abstraction layer — its own model format, its defaults, its API shape — trades away control and, in some configurations, throughput, versus running llama.cpp directly. This has real substance too: power users who tune context size, batch size, and quantization by hand routinely get more out of llama.cpp than out of Ollama’s defaults. It is not evidence that Ollama is broken; it’s evidence that Ollama optimizes for a different user than the one filing the complaint.

Both threads are worth taking seriously. Neither is a reason to avoid Ollama if what you actually want is the fastest path from “I have a GPU” to “I am chatting with a local model.” They are reasons to know what you’re trading away when you pick ergonomics over control — and to not pretend the choice is free.

How much does the runtime actually cost you in speed?

Less than the internet implies, at least for single-user chat. LocalRig’s first-party test on a base Apple M4 (16GB unified memory) running Llama 3.1 8B at Q4_K_M found llama.cpp b9820 at 18.4 tok/s and Ollama 0.30.11 at 19.5 tok/s (measured 2026-06-27) — Ollama was marginally faster on this run, not slower. That’s one model, one machine, one snapshot in time, and it will vary with build, quantization, and context length. But it’s a useful corrective to the reflexive “Ollama is slower because it’s an abstraction layer” line repeated in this spring’s crop of near-identical comparison posts. On a single GPU or Apple Silicon box serving one user, the runtime choice is not where your tokens per second go missing.

Where the gap becomes real is concurrent load. A widely repeated figure — vLLM at roughly 793 tok/s aggregate versus Ollama at roughly 41 tok/s under concurrent requests — circulates across benchmark posts and comparison articles this spring. LocalRig could not identify a primary source or reproducible methodology behind that number, so treat it as directional only: it illustrates that vLLM’s continuous batching gives it a large structural advantage once you have multiple simultaneous requests hitting the same GPU, not a number you should plan capacity around. If you need real concurrent-serving numbers, benchmark your own model and hardware with vLLM’s own tooling before committing budget.

Comparison table: Ollama vs llama.cpp vs vLLM

Ollamallama.cppvLLM
Best forGetting started, single-user chatControl, exotic/unsupported hardware, custom buildsConcurrent multi-user serving
Setup effortLowest — one commandModerate — compile flags, manual configHigher — Python env, GPU-specific setup
Model formatGGUF (own registry/format handling)GGUFSafetensors, GPTQ/AWQ; limited GGUF
Single-user tok/s (~7-8B model)Comparable to llama.cpp (LocalRig: 19.5 tok/s, M4 16GB)Comparable to Ollama (LocalRig: 18.4 tok/s, M4 16GB)Not optimized for single-stream; overhead for one user
Concurrent-request throughputWeaker — no continuous batchingWeaker — single-stream focusStrongest — continuous batching, PagedAttention
Hardware breadthCUDA, ROCm, Metal, CPUBroadest — CUDA, ROCm, Metal, Vulkan, CPU, niche/old hardwareCUDA-first; narrower GPU support
Multi-GPU servingBasicBasic/manualPurpose-built (see vLLM multi-GPU setup)
Governance noteggml-direct backend since ~mid-2025 (secondhand); attribution debate ongoingUpstream engine; broad community trustBacked by active project + commercial ecosystem

Sources for the table figures are the frontmatter sources: list; the concurrent-throughput row is qualitative, not a specific number, for the reasons above.

Which runtime fits your workload?

You’re getting started or running one model for yourself: Ollama

If the job is “run a model on my own machine and talk to it,” Ollama’s model registry and one-command setup remove real friction, and LocalRig’s own numbers show you are not paying a meaningful speed tax for that convenience on a typical single-GPU or Apple Silicon box. This is also the runtime most beginner guides assume — see LM Studio vs Ollama if you want a GUI-first alternative in the same ergonomics tier. Weigh the attribution debate above on your own terms; it does not change the technical fitness for this use case.

Hardware to run it on: any 24GB card handles most 7B-13B models comfortably — see the best GPU for local LLM inference guide, with the used RTX 3090 as the community value pick.

You want control, custom quantization, or unsupported hardware: llama.cpp direct

If you’re tuning build flags, running on hardware Ollama doesn’t officially support, or want the reference implementation without a management layer on top, llama.cpp directly is the honest choice. It’s also the right pick if you distrust an abstraction layer between you and the model weights, for either performance or governance reasons. The tradeoff is more manual setup — no model registry, no auto-download, you manage GGUF files and build flags yourself. See running llama.cpp on an RTX 3090 for a concrete single-GPU walkthrough.

You’re serving multiple concurrent users: vLLM

If more than one person or process is hitting the same model at the same time — a team tool, an API endpoint, a small internal product — vLLM’s continuous batching is built for exactly that problem, and neither Ollama nor llama.cpp has an equivalent scheduler. This is also where the hardware calculus changes: you’re now sizing for aggregate throughput under load, not one person’s tok/s, and multi-GPU setups start making sense at a much lower workload threshold than for single-user inference. See vLLM multi-GPU setup guide for the configuration details.

If you’d rather not buy the hardware to test this at concurrent scale, renting time on a cloud GPU to validate your batching setup before committing capital is a reasonable middle step — RunPod is a common option for spinning up a vLLM-serving instance temporarily (plain URL, no referral program yet): runpod.io.

Bottom line

None of these three runtimes is “the winner” — they answer different questions. Ollama answers “how do I get a model running with the least friction,” llama.cpp answers “how do I get exact control over the inference engine,” and vLLM answers “how do I serve many people at once without the GPU choking.” LocalRig’s own single-GPU test found Ollama and llama.cpp within a token per second of each other, so don’t let speed anxiety drive that choice — let workload shape drive it. The “Don’t Use Ollama” wave raised a real governance question worth reading yourself, and a real ergonomics-vs-control tradeoff worth weighing, but it settled nothing about which runtime is technically correct for a single user on one GPU. Pick based on how many people are asking the model questions at once, not based on which post was loudest this spring.

Sources

  • Ollama GitHub repository and release notes: github.com/ollama/ollama (2025-2026)
  • llama.cpp GitHub repository: github.com/ggml-org/llama.cpp (2025-2026)
  • vLLM project documentation and GitHub: github.com/vllm-project/vllm (2025-2026)
  • r/LocalLLaMA 'Don't Use Ollama' discussion threads (community-cited, May-June 2026)
  • LocalRig first-party benchmark: base Apple M4, 16 GB — llama.cpp b9820 (18.4 tok/s) and Ollama 0.30.11 (19.5 tok/s), Llama 3.1 8B Q4_K_M, 2026-06-27