Software & Runtimes

ExLlamaV3 vs GGUF in 2026: Speed, Quality, and Which Stack Deserves Your VRAM

The Staleness Problem

The ExLlamaV3-vs-GGUF debate is real, but almost all circulating data is from 2024, comparing EXL2 (not EXL3) to GGUF at an earlier llama.cpp version. Community benchmarks commonly cite ~14,000 tok/s prefill for EXL2 on an RTX 4090 versus ~7,500 tok/s for GGUF — a 2:1 gap that drove the excitement. But that data is now two years old, EXL3 arrived mid-2025 with claimed improvements, and llama.cpp has shipped multiple generations of GGUF optimization.

The honest disclosure: 2026 EXL3-vs-GGUF benchmarks do not yet exist. This comparison maps what is known, flags what is unknown, and names the benchmark gap rather than recycling 2024 numbers as current. If you are considering EXL3, you are on the edge of the community, and the choice is more about your constraint (GPU-only, needs prefill speed) than a settled speed win.

The Core Difference: Architecture, Not Just Quantization

ExLlamaV3 and GGUF are not quantization formats — they are inference engines that happen to use quantized weights. The difference is design:

ExLlamaV3 (maintained by turboderiv, integrated into TabbyAPI and other servers) is built for CUDA-only, maximum-speed inference. It assumes your weights live on an NVIDIA GPU, optimizes every operation for that GPU’s architecture, and offloads memory management to CUDA. The payoff is speed, especially for the prefill phase (processing your input prompt). The cost is portability: it does not run on CPU, does not run on Mac, does not run on AMD cards.

GGUF (the format; llama.cpp is the canonical engine) is built for portability and flexibility. GGUF weights can run on GPU, CPU, or a mix (CPU offload to GPU, CPU fallback). The inference loop is not GPU-specific; it adapts to whatever device you have. The throughput ceiling is lower than CUDA-optimized code, but the range of hardware is vastly wider. If you ever want to run a model on a laptop, a phone, or an older Mac, GGUF is the only option.

These are fundamentally different trade-offs, not a “faster vs slower” ranking.

Speed: Prefill vs Decode

The speed gap cited in 2024 benchmarks (~2:1 for EXL2) is almost entirely in prefill: processing your input tokens, the one-time cost of parsing your prompt. Prefill is compute-bound and benefits heavily from GPU parallelism. EXL3 is tuned for prefill speed.

Decode — generating the next token, one at a time, over and over — is a different beast. It is memory-bandwidth-bound: you read the entire model weights once per generated token, so GPU parallelism is less leveraged. At equal quantization (same bits per weight), the decode speed gap between EXL3 and GGUF I-quants is much smaller than the prefill gap. Community testing suggests parity or near-parity for decode at equal bpw.

The size of that prefill advantage in practice depends on your workload:

  • Long-context, many tokens to process upfront: Prefill dominates the wall-clock time. EXL3’s 2–3× prefill speed is a real win.
  • Short prompts, focus on output generation: Prefill is a tiny fraction of the total time. The speed difference is barely noticeable.
  • Batching multiple requests: The economics change entirely (vLLM / SGLang territory); neither consumer-grade tool is built for that.
ScenarioPrefill ImpactEXL3 AdvantageRecommendation
Chat (short prompt, interactive)~5–10% of timeMarginal (<1 sec)Either; GGUF preferred for flexibility
Document processing (2K+ tokens)~30–50% of timeSignificant (2–5 sec)EXL3, if GPU-only is acceptable
Batch inference (many users)N/AN/ANeither; use vLLM/SGLang + GPU cluster
Interactive coding (medium context)~15–25% of timeNoticeable (0.5–2 sec)GGUF for CPU offload; EXL3 for speed

Quality: Roughly Equivalent at Equal Bits-Per-Weight

The quantization formats differ in how they pack weights, but at the same bits-per-weight (bpw), GGUF I-quants and EXL3 quants achieve similar accuracy. Community MMLU benchmarks (not independently verified by LocalRig) report:

  • GGUF I4, Q4_K_M: ~73–75 MMLU on Llama 3.1 8B
  • EXL3 4-bit (INT4): ~73–74 MMLU on Llama 3.1 8B

The differences are within noise. If you are worried about quality loss, the quantization level (4-bit vs 5-bit vs 8-bit) matters far more than the format. Pick the bpw you need, then choose the engine based on speed and portability.

One caveat: EXL3 3-bit and 2-bit quants claim better quality at ultra-low precision than GGUF equivalents (fewer community benchmarks here). If you are pushing the limits of quantization for VRAM, EXL3 may squeeze more usable quality into fewer bits. This is a power-user edge case, not a general recommendation.

The GPU-Only Ceiling: A Real Constraint

ExLlamaV3’s speed comes from being purpose-built for CUDA. That design choice excludes:

  • CPU offload: GGUF lets you offload layers to CPU when VRAM runs short; EXL3 does not. On a card with insufficient VRAM for a full model at your chosen quantization, GGUF degrades gracefully; EXL3 fails.
  • Mac users: GGUF runs on Mac via llama.cpp or Ollama. EXL3 does not. If you ever want to test on a Mac or work on an Apple Silicon machine, GGUF is the only path.
  • AMD cards: GGUF can run on AMD via ROCm (though llama.cpp support is newer). EXL3 is CUDA-only.
  • Older or budget NVIDIA cards with low VRAM: EXL3 without CPU offload is a harder ceiling. A 12GB card running EXL3 must fit the full model in 12GB; a 12GB card running GGUF can offload layers to system RAM and still work (slower, but not a wall).

If any of these constraints apply to you, GGUF is not the second-best choice — it is the only choice. The speed comparison is irrelevant when EXL3 simply does not run on your hardware.

Portability and Ecosystem

GGUF: Runs via llama.cpp (C++, very portable), Ollama (user-friendly wrapper), LM Studio (UI), Kobold.cpp, and others. New servers and integrations arrive regularly. The ecosystem is broad and fragmented — you have many tools to choose from, and almost any setup you can dream up has a GGUF path.

EXL3: Primarily runs via TabbyAPI (the dedicated server), TextGen WebUI (with EXL3 backend), and a few specialized integrations. Smaller ecosystem, but well-maintained by active developers. If TabbyAPI works for your use case, the setup is simple; if you have a specific integration in mind, check first.

The Benchmark Gap: What We Actually Need in 2026

Here is what we know:

  • EXL2 was ~2× faster at prefill than 2024-era GGUF (community-cited, not independently verified by LocalRig).
  • EXL3 claims improvements over EXL2.
  • Modern GGUF (2025–2026) has also improved.

Here is what we don’t know:

  • EXL3 vs modern GGUF I-quants on the same hardware (RTX 4090, RTX 3090, M4 Pro, etc.), measured in 2026.
  • Decode speed parity or gap at equal bpw for EXL3 vs current GGUF.
  • The size of EXL3’s advantage on real workloads (not synthetic benchmarks).
  • Whether EXL3’s claimed 3-bit quality win is real or marketing.

This article should have included fresh benchmarks. It does not, because LocalRig’s current benchmark infrastructure is first-party (base Apple M4) and does not cover CUDA hardware for this comparison. EXL3-vs-GGUF is a candidate for future first-party work; if you see it benchmarked properly in 2026, that work was not yet public when this article shipped.

Until then, the decision is not data-driven. It is constraint-driven.

Who Should Pick EXL3, Who Should Pick GGUF

Choose ExLlamaV3 if:

  • You have a dedicated NVIDIA GPU (RTX 3090, RTX 4090, etc.) and will not move it between machines.
  • You regularly process long contexts (2K+ tokens) where prefill dominates wall-clock time.
  • You are willing to trade portability for maximum decode speed.
  • You want to squeeze the last bits-per-weight out of quantization (sub-4-bit, quality-critical workloads).

Choose GGUF if:

  • You might run on CPU, Mac, or move between devices.
  • You want the largest ecosystem of tools and integrations.
  • You have an older GPU, a budget card, or not enough VRAM for the full model unquantized.
  • You value simplicity and “it just works” over maximum speed.
  • You are unsure and want the safe default.

How to Actually Use These Tools

TabbyAPI + ExLlamaV3: Install via the TabbyAPI repo, point it at an EXL3-format model (download from HuggingFace), and start the server. It exposes an OpenAI-compatible API, so integration is straightforward. Performance is excellent on RTX 4090 (reflex-level speed for short workloads).

llama.cpp + GGUF: ./llama-cli -m model.gguf -p "your prompt" or run Ollama, or LM Studio. GGUF models are ubiquitous on HuggingFace. Setup is faster and the barrier to entry is lower.

For help choosing the right quantization format before you even reach this choice, see GGUF vs GPTQ vs AWQ and Which Quant Should I Download.

For GPU hardware context

This comparison assumes you have a GPU (the one place EXL3 makes sense). If you are shopping for that GPU, the used RTX 3090 buying guide covers why a 24GB card is the right tier, and run llama.cpp on RTX 3090 walks the practical setup for the more portable choice.

Bottom Line

ExLlamaV3 is genuinely faster at prefill on NVIDIA GPUs, but the advantage is smallest for the workloads most people actually do (chat, short prompts, document work). GGUF’s flexibility and ecosystem are no longer narrow tradeoffs — they are real advantages for anyone not 100% committed to a single GPU. The 2024 data that drove EXL3 excitement is two years stale, and 2026 benchmarks do not yet exist. Choose based on your hardware constraints and portability needs, not on speed claims you cannot verify. If GGUF works for you, it is the safer default; if you need maximum prefill speed and can lock in a GPU, EXL3 is the specialized win.

Sources

  • matt-c1 (r/LocalLLaMA): EXL2 prefill ~14k tok/s vs GGUF ~7.5k tok/s on RTX 4090 (2024) — no independent LocalRig verification, EXL2-era data, not EXL3
  • GGUF I-quantization research (vllm, llama.cpp): equal bits-per-weight, MMLU parity with EXL formats (community-cited, 2024–2025)
  • ExLlamaV3 (turboderiv) GitHub repo and TabbyAPI server documentation (2025–2026)
  • LocalRig first-party benchmark: base Apple M4, 16 GB — 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11), Llama 3.1 8B Q4_K_M, 2026-06-27