What Can I Run?

Quantization: What It Means for Local AI and Why It Matters

When a GPU listing says “8 GB VRAM” and a model card says “requires 16 GB,” the practical question is: can quantization bridge that gap? Usually yes — and understanding which quantization format to use is the most important technical decision in running local AI. The right choice doubles what your hardware can run; the wrong one means a model either won’t load or will run unusably slowly.


Definition

Quantization — the process of reducing the numerical precision of a neural network’s weight values from high-precision formats (FP32: 32-bit, or FP16: 16-bit) to lower-precision representations (INT8: 8-bit, INT4: 4-bit, or mixed-precision). Per the Hugging Face quantization documentation, this reduces model size and memory bandwidth requirements at the cost of small, measurable reductions in output quality.

At a conceptual level: a language model is a large collection of numbers (weights) that transform input tokens into output tokens. Those numbers are normally stored in 16-bit floating point format. Quantization replaces each weight with an approximation that takes fewer bits to store. A 7B-parameter model at FP16 requires approximately 14 GB of memory (7 billion × 2 bytes). The same model at 4-bit quantization requires approximately 4–5 GB — fitting comfortably on consumer hardware that would otherwise be excluded.


Why Quantization Matters for Local AI

Quantization is the lever that makes most consumer hardware viable for local inference. Without it, only GPUs with 16 GB or more of VRAM could run 7B models. With Q4_K_M quantization, an 8 GB card runs a 7B model at interactive throughput (typically 30–80 tok/s, depending on hardware bandwidth).

The trade-off is quality, not safety. Quantized models produce slightly less precise outputs than their FP16 counterparts, but the practical difference at Q4_K_M is small enough that most users cannot identify which output came from the quantized version versus the full-precision version in blind comparisons. Community evaluations on llama.cpp benchmark suites show Q4_K_M retaining approximately 95–97% of FP16 perplexity scores on standard language modeling tasks.

Two separate effects make quantization useful beyond raw memory savings:

Memory bandwidth improvement. LLM inference is memory-bandwidth-bound, not compute-bound. Moving smaller weight values between GPU memory and compute cores takes less time, which means quantized models generate tokens faster than their size reduction alone would suggest. A Q4 model does not just use half the VRAM of a Q8 model — it often generates tokens 40–70% faster as well, because memory bandwidth is no longer the bottleneck.

Larger models become accessible. A 34B model at Q4_K_M requires approximately 20 GB. That fits on a used RTX 3090 (24 GB), which sells for approximately $600–$800 on eBay. The same model at FP16 requires 68 GB — a hardware requirement that puts it in data center territory.

The Practical Rule

  • Q4_K_M is the baseline for most hardware. It represents the best quality-per-byte trade-off in the llama.cpp GGUF ecosystem. Start here unless you have specific reasons to go higher or lower.
  • Q8_0 is the next step up. Use it when you have spare VRAM headroom and want marginally better quality on tasks where precision matters (code generation, structured output, long-context reasoning).
  • Q2/Q3 should be avoided for regular use. Quality degradation is measurable and noticeable. These formats exist for extreme memory constraints only.
QuantizationMemory (7B model)Typical tok/s on RTX 3090Quality vs FP16
FP16~14 GB~80 tok/sBaseline
Q8_0~7.5 GB~95 tok/s~99% (community-cited)
Q4_K_M~4.5 GB~110 tok/s~95–97% (community-cited)
Q3_K_M~3.5 GB~120 tok/s~90–93% (community-cited)
Q2_K~2.8 GB~130 tok/s~80–85% (community-cited)

Community-cited ranges from llama.cpp GitHub benchmark threads and r/LocalLLaMA, 2024–2025. Your results will vary by runtime version, context length, and hardware generation.


Quantization vs Model Size

These are independent dimensions that are frequently confused. A 70B Q4_K_M model uses approximately 40 GB — more than a 7B FP16 model at 14 GB. Quantization reduces a specific model’s footprint; it does not change the parameter count, which determines the ceiling of the model’s capability.

The practical implication: quantization does not let you run a 70B model on an 8 GB GPU. It lets you run a 70B model on hardware that has 40+ GB of memory — a 3×RTX 3090 NVLink setup, a dual-GPU workstation, or an Apple Mac Studio with 96 GB unified memory.

DimensionQuantizationModel Size (Parameters)
What it isNumerical precision of weightsCount of weight values in the model
Why it mattersDetermines memory per parameterDetermines model capability ceiling
When it’s the bottleneckMemory capacity (VRAM full)Quality ceiling (model isn’t capable enough)

Common Misconceptions

”Lower quantization is always slower”

Not true for LLM inference. Because inference is memory-bandwidth-bound, lower quantization (smaller weights) often produces faster throughput. Q4_K_M is typically faster than Q8_0 on the same hardware because the GPU can move the smaller weight values through its memory bus more quickly. The speed advantage of lower quantization only disappears if compute — not bandwidth — becomes the bottleneck, which rarely happens in standard single-user inference scenarios.

”Q4 quantization means 4× worse quality”

The “4” in Q4 refers to bits, not a quality fraction. A 4-bit quantized weight uses 4 bits of precision instead of 16 (FP16) or 32 (FP32) bits. The actual quality reduction from FP16 to Q4_K_M is approximately 3–5% on standard perplexity benchmarks — not 75%. The K_M suffix in Q4_K_M indicates that this is a mixed quantization format where more important weights receive higher precision, which is why the quality loss is so small despite the aggressive bit reduction.


How Quantization Appears in Hardware Listings

Quantization levels appear in model file names, not GPU specs. A GPU spec sheet will list VRAM capacity in GB and memory bandwidth in GB/s. The quantization format determines how much of that VRAM capacity the model occupies.

In practice, you encounter quantization through:

  • GGUF file names on Hugging Face: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf — the suffix after the model name is the quantization format.
  • Ollama model tags: llama3.1:8b-instruct-q4_K_M — same convention in the Ollama model library.
  • llama.cpp -q flag: when building a quantized model from scratch with llama.cpp, the -q argument specifies the target format.

What to watch for when reading VRAM requirements: requirements cited in model cards are typically for FP16 inference. If a model card says “requires 16 GB VRAM,” that means FP16. The Q4_K_M version of the same model will require roughly 30% of that — approximately 5 GB for a 7B model.


Frequently Asked Questions

What is quantization?

Quantization is the process of reducing the numerical precision of a language model’s weights from high-precision floating point (FP16 or FP32) to lower-precision formats such as 8-bit integers (INT8) or 4-bit integers (INT4). This reduces the amount of VRAM the model requires to run, enabling consumer hardware with limited memory to run models that would otherwise not fit. The trade-off is a small reduction in output quality, which is generally negligible at Q4_K_M for most use cases.

How much quantization do I need to run a 7B model?

A 7B model at Q4_K_M requires approximately 4.5–5 GB of VRAM plus runtime overhead (typically 0.5–1 GB for the KV cache), for a total of 5–6 GB. An 8 GB GPU is the practical minimum — it works but leaves little headroom for longer context windows. A 12 GB or 16 GB card (such as an RTX 4070 or RTX 4080) provides comfortable headroom and allows running Q8_0 for better quality. See the Hardware to Run a 7B Model Locally guide for hardware-tier recommendations.

Does more quantization always mean better performance?

More aggressive quantization (lower bit depth) generally means faster tok/s on memory-bandwidth-limited hardware, but quality degrades below Q4_K_M in ways that affect output reliability. The sweet spot for most buyers is Q4_K_M: it fits on common consumer hardware, runs fast, and retains enough quality for coding, writing, and analysis tasks. Going lower than Q4 trades quality for speed in ways that most users find unacceptable for regular use. Going to Q8_0 or FP16 trades speed for quality — useful when you have VRAM to spare and quality matters more than throughput.


See Hardware to Run a 7B Model Locally for hardware recommendations based on the VRAM constraint that quantization determines.


Sources

Sources

  • Hugging Face documentation — Quantization, https://huggingface.co/docs/transformers/quantization (accessed 2026-06-27)
  • llama.cpp GitHub — GGUF quantization formats and sizes, https://github.com/ggerganov/llama.cpp (accessed 2026-06-27)
  • TheBloke GGUF model cards — per-quantization size and quality notes, https://huggingface.co/TheBloke (accessed 2026-06-27)