What Can I Run?

GGUF vs GPTQ vs AWQ: Which Model Format for Your Setup?

Model format confusion is really runtime confusion. You see a .gguf file, a .gptq.safetensors file, and an exl2 model, and they look like the same thing with different names. They are not. Each format is optimized for a specific runtime and hardware path. Download the wrong one for your setup and either the model will not load, or it will load but run slowly.

This glossary maps format to the runtime it is designed for, and then to the hardware sweet spot. If you choose the runtime first — based on whether you are on a Mac, using a single NVIDIA GPU, serving multiple users, or pushing throughput — the format decision becomes obvious.

The principle: format follows runtime, not the reverse

Here is the mental model: you have a rig (a Mac, a single RTX 4090, two RTX 3090s, or a small cluster). You choose a runtime (llama.cpp, Ollama, vLLM, ExLlamaV3). The runtime expects a specific format. You download the model in that format. The format itself is not inherently “faster” or “slower” — the runtime is. The format is just the encoding the runtime uses to unpack the weights.

FormatNative RuntimeVRAMHardware fitTrade-off
GGUFllama.cpp, OllamaFlexible (Q2 → F32)1× consumer GPU, CPU offload, MacGreat CPU offload; slower than vLLM on GPU-only inference
GPTQvLLM, AutoGPTQ, ExLlamaV2Fixed at quantization timeNVIDIA GPU (preferred), multi-GPUFast on NVIDIA; GPU-only; less flexible than GGUF
AWQvLLM, AutoAWQFixed at quantization timeNVIDIA GPU (preferred), multi-GPUNewer than GPTQ; better quality at same bpw; otherwise similar
EXL3ExLlamaV3Ultra-compact (2–4 bpw)Consumer NVIDIA GPU (throughput-focused)Peak throughput; inflexible format; specialized runtime

GGUF: The portable format

GGUF (GPT-Generated Unified Format) powers llama.cpp and Ollama. It is the format you download if:

  • You are running on a Mac (any Apple Silicon) — llama.cpp and Ollama both use GGUF and support GPU offload (Metal API) and CPU inference.
  • You want CPU offload on an NVIDIA GPU — load part of the model in VRAM and stream activations through the GPU, compute the rest on CPU. This is slower than full GPU inference but lets you run larger models on limited VRAM.
  • You want simplicity and flexibility over peak throughput — a single GGUF file, one command, works across runtimes (llama.cpp, Ollama, LM Studio, Jan, Hugging Face local).
  • You are on a single GPU and want CPU fallback as a safety net.

GGUF trade-offs:

  • Quantization levels are baked into the filename (Q4_K_M, Q5_K_M, Q2_K, F16, etc.) — you choose VRAM vs. quality upfront, not at runtime.
  • Per community benchmarks (2024–2025; not independently verified by LocalRig), peak throughput on NVIDIA GPU-only inference lags vLLM by ~10–20% for the same quantization, because vLLM’s batching and kernel optimization are runtime-level, not format-level.
  • CPU offload is wonderful for Macs; on NVIDIA it works but trades speed for flexibility.

Size reference (7B model): Q2_K ≈ 2.8 GB | Q3_K ≈ 3.5 GB | Q4_K_M ≈ 4.3 GB | Q5_K_M ≈ 5.3 GB | Q6_K ≈ 6.5 GB | F16 ≈ 14 GB.

For the quantization-level math and when to choose each, see What is Quantization.

GPTQ: The vLLM standard

GPTQ (Generative Pre-trained Transformer Quantization) is the default quantization format for vLLM deployments. Download GPTQ if:

  • You are running vLLM on one or more NVIDIA GPUs.
  • You want good quality at low VRAM and solid support across vLLM, AutoGPTQ, and ExLlamaV2.
  • You are serving multiple concurrent users (vLLM’s batching and scheduling assume GPTQ or AWQ).
  • You need multi-GPU throughput — vLLM’s tensor parallelism is production-tested on GPTQ.

GPTQ trade-offs:

  • The quantization level is baked in the model (e.g., “model-4bit-128g” means 4-bit, 128-group size). Unlike GGUF, you cannot load the same model at different quantization levels; you download the exact variant you need.
  • GPTQ is NVIDIA-focused. It can run on CPU via AutoGPTQ or llama-cpp-python’s GPTQ support, but it is slower and less mature than GGUF on CPU.
  • Perplexity and quality are solid at 4-bit, but 3-bit GPTQ shows more degradation than 3-bit GGUF.

Size reference (7B model): 4-bit ≈ 4–4.5 GB | 3-bit ≈ 3–3.5 GB.

For comparisons of GPTQ vs. AWQ vs. other quants at various bitwidths, see Which Quantization Should I Download.

AWQ: The newer alternative to GPTQ

AWQ (Activation-Aware Weight Quantization) is vLLM’s newer preferred quantization. It emerged in 2024 and has steadily replaced GPTQ for new models on Hugging Face. Download AWQ if:

  • You are using vLLM (2025+) and the model is published in AWQ.
  • You want 4-bit quality slightly better than GPTQ at the same bitwidth and group size — AWQ’s activation-aware scheme preserves more signal in the critical weights.
  • You are multi-GPU — vLLM’s tensor parallelism is equally optimized for AWQ as GPTQ.
  • You want a path that will still be actively developed in 2027.

AWQ trade-offs:

  • Same baking-in of quantization levels as GPTQ — you choose 4-bit or 3-bit upfront.
  • Not yet as universally supported as GPTQ outside of vLLM; AutoAWQ is mature, but standalone AWQ support in other runtimes is less common.
  • Older papers comparing GPTQ to AWQ (2024 early) are outdated; 2025 comparisons favor AWQ for equal bitwidth, but the gap is small.

Size reference (7B model): 4-bit ≈ 4–4.5 GB | 3-bit ≈ 3–3.5 GB (nearly identical to GPTQ).

EXL3: The throughput edge

EXL3 is ExLlamaV3’s native format — designed for maximum decode speed on consumer NVIDIA GPUs. Download EXL3 if:

  • You are using ExLlamaV3 and want the absolute highest tokens/second on a single card.
  • You are willing to use a specialized runtime (ExLlamaV3) rather than the broader ecosystem (vLLM / Ollama).
  • You want to squeeze ultra-low bitwidths (2–4 bits per weight) without the quality loss that older quantizations suffered.

EXL3 trade-offs:

  • Format inflexibility — EXL3 is ExLlamaV3-only. If you want to switch to vLLM or llama.cpp, you need to re-download the model in GPTQ or GGUF.
  • Community-developed format with smaller ecosystem — the bleeding-edge appeal and potential format changes should weigh into planning.
  • Earlier claims of EXL3 speed advantage over vLLM/GPTQ (e.g., community comparisons in 2024) used older vLLM versions; as of 2025, vLLM’s kernel optimization has narrowed the gap. EXL3 is still excellent, but not a free 20% win.

Size reference (7B model): 3-bit ≈ 2.8 GB | 3.5-bit ≈ 3.5 GB | 4-bit ≈ 4.2 GB.

Choosing your format: the decision tree

  1. What runtime are you using?

    • llama.cpp → GGUF only. Done.
    • Ollama → GGUF only. Done.
    • vLLM → AWQ or GPTQ (AWQ preferred for new models; GPTQ for older ones). Choose based on what the model publisher provides.
    • ExLlamaV3 → EXL3 if available, otherwise GPTQ via ExLlamaV2 fallback.
    • AutoGPTQ or ExLlamaV2 → GPTQ.
  2. What if the model is not published in your format?

    • Download the next-best format your runtime supports. vLLM supports both AWQ and GPTQ; pick whichever exists.
    • Do not attempt format conversion; it is unsupported and lossy.
    • If the model is not published for your runtime at all (e.g., a new model only in GPTQ, but you use llama.cpp), wait for a community quantizer to publish a GGUF version, or fund one yourself.
  3. Still unsure? Pick by hardware.

    • Mac → llama.cpp + GGUF.
    • Single NVIDIA GPU, single user → llama.cpp or Ollama + GGUF for simplicity, or vLLM + AWQ for peak speed.
    • Multi-GPU, or serving many concurrent users → vLLM + AWQ (or GPTQ if model not in AWQ).
    • Throughput-obsessed, consumer GPU → ExLlamaV3 + EXL3 (if available).

For runtime comparisons and trade-offs, see Ollama vs llama.cpp vs vLLM.

When format choice does NOT matter

  • Inference accuracy on identical bitwidth: A 4-bit GGUF Q4_K_M and a 4-bit AWQ model run at the same effective precision if the group sizes and other parameters match. The format does not make the quantization better or worse — the quantization scheme and the weights do.
  • Model capability: The underlying model (Llama 3.1 8B, Mixtral 8x7B, etc.) is the same regardless of format. Format is encoding, not capability.

Bottom line

Format is not a choice — it is a consequence of choosing a runtime. Pick your runtime (based on your hardware and needs), look up what formats it accepts, download the model in that format, and move on.

If you are just starting out and do not have strong opinions about runtime:

  • Mac → Ollama + GGUF. One-click, works, CPU offload included.
  • One NVIDIA GPU → llama.cpp + GGUF (easiest) or vLLM + AWQ (fastest).
  • Multi-GPU or production → vLLM + AWQ.
  • Throughput-first, single consumer GPU → ExLlamaV3 + EXL3 (if the model is published in it).

For deeper runtime trade-offs and benchmarks, see ExLlamaV3 vs GGUF and how to run LLMs locally.

Sources

  • llama.cpp format and runtime documentation: github.com/ggml-org/llama.cpp (2024–2025)
  • vLLM AWQ / GPTQ quantization support: github.com/vllm-project/vllm (2025)
  • ExLlamaV3 format specifications: github.com/turboderp/exllamav3 (2024–2025)
  • Hugging Face quantization format registry: huggingface.co quantization docs (2024–2025)
  • r/LocalLLaMA community deployment threads (2024–2025)