Which Quant Should You Download? Q4_K_M vs Q8_0 vs F16, Decided by Your Hardware
You’ve picked a model. Now Hugging Face or Ollama shows you a wall of filenames — Q2_K, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16, sometimes an IQ3_XS or an UD-Q4_K_XL from Unsloth — and no explanation of which one to click. This is the decision the download page forces on every beginner, and it’s answerable with two questions: what fits, and what you can tolerate losing.
If you haven’t read the explainer on what quantization actually does to a model’s weights, start there: What Is Quantization. This article assumes you know the concept and answers the practical question — which specific file to download today.
What’s the difference between Q4_K_M, Q8_0, and F16?
They’re the same model at different levels of numerical precision, and precision trades directly against file size and memory use. F16 is the full-precision release (16-bit floats, no compression). Q8_0 rounds each weight to roughly 8 bits. Q4_K_M rounds to roughly 4 bits per weight on average, using a “K-quant” scheme that keeps some layers at higher precision to protect quality. Lower bit width means a smaller file, less RAM/VRAM needed, and faster decode — at the cost of more rounding error in the model’s outputs.
The rough sizing math for a 7B-class model, consistent with the quantization explainer:
| Quant | Approx. size (7B model) | Relative quality | Typical use case |
|---|---|---|---|
| F16 | ~14–16 GB | Reference/full precision | Research, fine-tuning source, max-fidelity serving |
| Q8_0 | ~7–8 GB | Very close to F16 | Precision-sensitive tasks (code, structured extraction) with VRAM to spare |
| Q6_K | ~5.5–6 GB | Small step down from Q8_0 | Middle ground when Q8_0 doesn’t fit but Q4 feels risky |
| Q5_K_M | ~4.5–5 GB | Minor step down | Slightly tighter fit, modest quality trade |
| Q4_K_M | ~4–4.5 GB | Community default — “good enough” for most tasks | General chat, coding assistants, most local use |
| Q3_K_M | ~3–3.5 GB | Noticeable degradation begins | Tight VRAM budgets only |
| Q2_K | ~2.5–3 GB | Significant quality loss | Emergency fit, not recommended if avoidable |
These numbers scale with model size — a 70B model at Q4_K_M runs roughly 38–42 GB rather than 4 GB — but the ratios between quant levels hold. For the exact math on your model and your hardware, run it through the VRAM calculator rather than eyeballing gigabyte estimates.
Which quant should I actually download?
For almost everyone, the answer is Q4_K_M. It’s the community-settled default for a reason: it fits comfortably on mainstream consumer hardware (12–24 GB GPUs, 16 GB+ unified memory Macs) and the quality loss versus F16 is small enough that most people can’t reliably detect it in normal chat, summarization, or coding-assistant use. This is why Ollama’s default pull and most GGUF repo “recommended” tags point at Q4_K_M or something very close to it.
The selector logic, in order:
- Does Q4_K_M fit in your available memory with headroom for context? If yes, and your task isn’t precision-critical, download it and stop here. This resolves the vast majority of cases.
- Do you have VRAM to spare (24 GB+ GPU, or 32 GB+ unified memory) and a precision-sensitive task — code generation, structured JSON extraction, math, or anything where a wrong token in the wrong place breaks the output? Step up to Q6_K or Q8_0. The quality gain is real but incremental; you’re paying roughly double the memory for a smaller improvement than the jump from Q2 to Q4 delivered.
- Are you fine-tuning, doing research, or need a byte-exact reference to compare quantized outputs against? Only then do you need F16. It is not a “better chat experience” tier for most people — it’s a tooling requirement.
- Does even Q4_K_M not fit? Before dropping to Q3_K_M or Q2_K, check whether a smaller parameter-count model at a higher quant beats a bigger model crushed down. A 7B at Q6_K frequently outperforms a 13B forced to Q2_K. This is also the point to check what actually runs on your hardware before assuming you need the biggest model on the page.
If step 4 is where you land — model won’t fit even at Q4_K_M — that’s a hardware ceiling, not a quant problem, and worth naming honestly rather than chasing degradation with ever-smaller quants. A used RTX 3090 24GB (~$500–$800 used, observed 2026-06-29) or a Mac with more unified memory changes which quant tier you’re choosing from. See the used RTX 3090 buying guide or best Mac for local LLM if the honest answer is “upgrade.”
What quality do I actually lose going from F16 to Q4_K_M?
For general chat and assistant tasks, most people cannot reliably tell the difference in casual use — the gap is well inside what a normal conversation surfaces. The loss becomes visible in specific edge cases: long multi-step reasoning chains where small errors compound, exact-format outputs (strict JSON, code that must compile), and low-resource languages where the model’s confidence was already thin at full precision. Quantization error doesn’t announce itself; it shows up as occasional subtle wrongness, not obvious breakage.
This is also where the file format matters, not just the bit count — GGUF’s K-quants, GPTQ, and AWQ all handle the precision/size trade differently and aren’t directly comparable bit-for-bit. If you’re choosing between formats rather than just quant levels within GGUF, see GGUF vs GPTQ vs AWQ for that layer of the decision.
Are Unsloth Dynamic 2.0 quants worth seeking out?
If you’re relying on a model daily and want more quality per gigabyte than a standard quant delivers, yes — they’re worth the extra search effort. Unsloth’s Dynamic 2.0 approach selectively keeps quality-sensitive layers at higher precision while compressing the rest harder, rather than applying one bit-width uniformly across the whole model. Unsloth’s own benchmarks (unsloth.ai, Feb 2026 — vendor benchmark, attributed, not independently verified by LocalRig) claim their Dynamic 2.0 quants score lowest perplexity and KL-divergence against standard quants at matching bit widths.
Two honest caveats on that claim. First, it’s a vendor benchmarking its own product — a reasonable methodology on paper, but not something LocalRig has reproduced. Second, the advantage matters most at aggressive compression (Q2/Q3, where standard quants degrade sharply) and matters less at Q4_K_M and above, where standard quants are already close to F16. If you’re downloading Q4_K_M anyway, a standard quant is fine; if you’re forced down to Q2/Q3 by hardware limits, seeking out an Unsloth Dynamic 2.0 (often labeled UD-Q2_K_XL or similar) build is the better move than accepting a standard low-bit quant’s degradation.
Does a sub-4-bit quant actually run slower than Q4_K_M on Apple Silicon?
This is a community claim, not a confirmed fact, and it deserves to be labeled that way rather than repeated as settled wisdom. The claim circulating on r/LocalLLaMA (2025–2026) is that IQ3 and IQ2 quants can run slower than Q4_K_M on Apple Silicon, despite moving fewer bits — attributed to higher dequantization overhead on the CPU/GPU path when unpacking the more complex IQ-quant bit-packing scheme.
LocalRig has not independently verified this. It’s plausible in principle — IQ-quants use more intricate packing to squeeze extra quality out of very low bit widths, and unpacking that scheme costs compute cycles that a simpler K-quant doesn’t — but “plausible” is not “measured.” Our own first-party Apple Silicon data point is narrower than this question: base Apple M4 (16 GB) ran Llama 3.1 8B at Q4_K_M at 18.4 tok/s (llama.cpp b9820) and 19.5 tok/s (Ollama 0.30.11), measured 2026-06-27. We have not run the equivalent IQ3/IQ2 comparison on the same hardware, so we can’t confirm or debunk the folklore here — it’s flagged as a candidate for a future first-party benchmark, not a reason to change your download today.
The practical takeaway either way: if your Mac’s unified memory comfortably fits Q4_K_M, there’s no reason to reach for an IQ3/IQ2 squeeze in the first place, so the question is mostly moot for people sized correctly. It only matters if you’re deliberately trading quality for capacity on a memory-constrained Mac — in which case, test both on your own hardware before committing, since the community reports are mixed and unverified.
Bottom line
Download Q4_K_M unless you have a specific reason not to. It fits mainstream hardware, it’s the format most tooling defaults to and tests against, and the quality loss versus F16 is smaller than most beginners fear. Step up to Q8_0 only when you have spare memory and a precision-sensitive task; step down only when hardware forces it, and prefer a smaller model at a higher quant over a bigger model crushed to Q2. If you’re relying on a model daily at an aggressive quant level, an Unsloth Dynamic 2.0 build is worth the extra search. And treat the Apple sub-4-bit slowdown claim as unconfirmed community folklore — worth knowing about, not worth basing a purchase or a quant choice on until someone actually measures it.