TL;DR — VRAM Requirements 2026 Cheat Sheet
Want to know what your hardware fits? Run 9bench (15s, browser-only) — it detects your GPU and shows VRAM-feasible models.

Every week on r/LocalLLaMA the same thread appears: "I have an X GB GPU, what can I run?" Most answers are guesses, opinions, or links to outdated wikis. This article gives you the calibrated answer — for every popular local AI model in 2026, exactly how much VRAM you need and what tokens/second you'll get on which GPU class.

Numbers come from public llama.cpp / ComfyUI benchmarks, the 9bench GPU-class lookup table (50+ calibrated entries), and r/LocalLLaMA community submissions. Where workloads vary (fp16 vs NF4 vs Q4 quantization), we list each path separately.

📐 What "VRAM needs X GB" actually means
Every number below is peak VRAM during inference, including model weights, KV cache, attention buffers, and CUDA / Metal kernel overhead. Quoting just "weights size" misleads — the real bottleneck is peak allocation, not idle. We err on the side of "comfortable" rather than "barely fits".

The big picture: VRAM tiers in 2026

VRAM Tier Example GPUs What runs comfortably What's painful What's impossible
4 GB GTX 1650, RTX 3050 4GB, Iris Xe (shared) Phi-3-mini, embeddings, Whisper Tiny Anything 7B+ SDXL, Flux, all video models
6 GB RTX 2060, GTX 1660 Ti, RTX 3050 6GB Llama 7B Q4 (tight), SD 1.5, Whisper Small Llama 13B, SDXL with refiner Flux fp16, Llama 30B+, video gen
8 GB RTX 3060 8GB, RTX 4060, RX 6600/7600 Llama 7B Q4, Qwen2-VL 7B, Flux.1-schnell, SDXL (--medvram) Llama 13B Q4 (no refiner), Flux.1-dev (--lowvram) Llama 30B+, HunyuanVideo
12 GB RTX 3060 12GB, RTX 4070, RX 7700 XT, M3 Pro Llama 13B Q4, SDXL+refiner, Flux.1-dev NF4, LTX-Video, HunyuanVideo Q4 Llama 30B Q4, Flux.1-dev fp16, multiple models loaded Llama 70B
16 GB RTX 4060 Ti 16GB, RTX 4080, RX 7800 XT, M3 Pro 18GB Llama 13B Q4 + LoRA, Flux.1-dev fp16, HunyuanVideo Q4 + ControlNet Llama 30B Q4 (no LoRA), Flux training Llama 70B
24 GB RTX 3090, RTX 4090, RTX 5090 (32GB), RX 7900 XTX, M3 Max 36GB Llama 30B Q4, Qwen2.5-Coder 32B Q4, Flux training, HunyuanVideo fp16 Llama 70B Q4 (only with 2× GPUs or offload) Llama 70B fp16 (need 2× 24GB)
48 GB+ 2× RTX 3090/4090, M3 Max 64GB, RTX 6000 Ada Llama 70B Q4, Qwen 72B Q4, full fine-tuning of 7B-13B Frontier-model fine-tuning
96+ GB unified M3 Max 96GB, M3 Ultra 192GB, M2 Ultra 128GB Llama 70B+ comfortably, 120B quantized models, multi-model serving Frontier-research-grade fine-tuning

The 12 GB tier is the 2026 sweet spot. Above it you're paying premium for a small list of additional capabilities (Llama 30B, Flux fp16). Below it you're constantly fighting OOM errors with --lowvram flags and disabled features.

Per-model VRAM requirements (sorted by what you'd actually run)

Text LLMs

Model Q4 weights Peak VRAM Min GPU RTX 4090 t/s
Phi-3-mini (3.8B) 2.2 GB 3-4 GB any 4 GB+ ~150-220
Phi-4-mini (4B) 2.5 GB 3-4 GB any 4 GB+ ~140-200
Qwen 2.5 7B 4.4 GB 5-6 GB RTX 2060 / 3050 6GB ~95-150
Llama 3.1 / 3.3 8B 4.7 GB 5-6 GB RTX 2060 / 3050 6GB ~90-145
Mistral 7B 4.4 GB 5-6 GB RTX 2060 / 3050 6GB ~95-150
Gemma 2 9B 5.4 GB 7-8 GB RTX 4060 / 3060 8GB ~75-115
Llama 13B 7.4 GB 9-10 GB RTX 3060 12GB / RTX 4070 ~50-75
Mistral Small 22B 13 GB 15-16 GB RTX 4060 Ti 16GB / 4080 ~38-55
Codestral 22B 13 GB 15-16 GB RTX 4060 Ti 16GB / 4080 ~38-55
Qwen2.5-Coder 32B 19 GB 22-23 GB RTX 3090 / 4090 / 5090 ~35-45
Mistral Large 2 (123B Q3) 50 GB 56 GB 2× RTX 4090 / M3 Ultra ~12-18
Llama 3.3 70B 40 GB 46-48 GB 2× RTX 3090/4090, M3 Max 64GB ~8-15 (Mac) / 50-80 (2×4090)
Qwen 2.5 72B 43 GB 48-50 GB 2× RTX 3090/4090, M3 Max 64GB+ ~8-14

Vision / multimodal

Model Q4 weights Peak VRAM Min GPU RTX 4090 t/s
Qwen2-VL 2B 1.4 GB 3-4 GB any 4 GB+ ~120-180
Qwen2-VL 7B 4.5 GB 6-7 GB RTX 2060 / 3060 / 4060 ~50-70
Llama 3.2 Vision 11B 6.5 GB 9-10 GB RTX 3060 12GB / RTX 4070 ~40-55
Pixtral 12B 7.5 GB 10-12 GB RTX 4070 / RX 7700 XT ~35-50
Llama 3.2 Vision 90B 52 GB 56-60 GB 2× RTX 4090, M3 Ultra ~6-10

Image generation

Model Quant Path Peak VRAM Min GPU RTX 4090 sec/image
SD 1.5 (512²) fp16 3-4 GB any 4 GB+ ~1-2s
SDXL (1024²) fp16 7-8 GB (no refiner) RTX 4060 / 3060 8GB ~3-5s
SDXL + refiner (1024²) fp16 11-12 GB RTX 3060 12GB / 4070 ~5-8s
Flux.1-schnell (1024², 4 steps) fp8 / NF4 7-8 GB RTX 4060 / 3060 8GB ~2-3s
Flux.1-dev (1024², 20 steps) NF4 / GGUF Q4 10-12 GB RTX 3060 12GB / 4070 ~12-18s
Flux.1-dev (1024², 20 steps) fp16 22-24 GB RTX 3090 / 4090 ~10-15s
SD3.5 Medium fp16 10-11 GB RTX 3060 12GB / 4070 ~8-12s
SD3.5 Large fp8 14-16 GB RTX 4060 Ti 16GB / 4080 ~12-18s

Video generation (the new frontier in 2026)

Model Quant Path Peak VRAM Min GPU RTX 4090 time
LTX-Video (5s 768×512) fp16 / fp8 7-8 GB RTX 4060 / 3060 8GB ~20-40s
HunyuanVideo (5s 720p, Q4) GGUF Q4 11-12 GB RTX 3060 12GB / 4070 ~4-6 min
HunyuanVideo (5s 720p) bf16 22-24 GB RTX 3090 / 4090 ~3-5 min
Mochi-1 Preview (5s 480p) fp8 22-24 GB RTX 3090 / 4090 ~5-8 min
CogVideoX 5B fp16 10-11 GB RTX 3060 12GB / 4070 ~8-15 min

Audio / TTS

Model Peak VRAM Min GPU Speed (RTX 4090)
Whisper Tiny (39M) 0.5 GB any (CPU OK) ~50× realtime
Whisper Small (244M) 1 GB any 2 GB+ ~10-20× realtime
Whisper Large-v3 (1.5B) 3-4 GB RTX 3050 / 1660 Ti+ ~5-8× realtime
F5-TTS 5-6 GB RTX 3060 / 2060 / 4060 ~15-25× realtime
XTTSv2 / Coqui 4-5 GB RTX 3060 / 2060 ~10-20× realtime

Quantization 101 — why "Q4" matters more than "GB"

All the numbers above assume Q4 quantization (also written Q4_K_M or NF4 depending on framework). Q4 means each weight is stored as a 4-bit integer with a scaling factor instead of fp16 (16 bits). That's a 4× memory reduction with ~95% quality retention for most models.

Other quantization paths to know:

Practical rule of thumb: Q4_K_M is what 95% of local LLM users run. Don't worry about quant choices unless you're squeezing a 70B into 24 GB (try Q3) or optimizing throughput on a 24 GB+ workstation (try AWQ).

The KV cache trap

"But the weights are only 4.5 GB, why does my 7B model fill 8 GB VRAM?"

Answer: KV cache. Each token your model generates has to remember the keys and values from prior tokens (this is what makes attention work). The KV cache grows with context length:

For most chat use cases, 4K context is plenty and KV cache is small. But if you're doing long-document summarization, code-base RAG, or 32K-token "agentic loops", VRAM requirements can double. Modern frameworks support KV cache quantization (Q4 KV cache) which cuts this in half with minimal quality loss.

⚠️ Why some "VRAM calculator" sites are wrong
Tools that compute "VRAM = weights × 1.2" miss KV cache, attention buffers, and CUDA overhead. Real-world peak VRAM is typically 1.4-1.6× the weight size. Always allow 30-40% headroom over the bare model size. The numbers in our tables above include this.

Apple Silicon: unified memory works differently

On Apple Silicon (M1/M2/M3/M4 series), there's no separate "VRAM" — the GPU shares system RAM via unified memory architecture. By default, macOS reserves about 75% of total RAM for GPU use. So:

You can manually raise the GPU memory limit with sudo sysctl iogpu.wired_limit_mb=<value>. Most Mac LLM tools (LM Studio, Ollama, MLX) document this. It's the single most-skipped setup step on Mac.

What if you're VRAM-poor? The workarounds that actually work

1. Quantization is your friend

Q4 instead of fp16 = 4× less VRAM. AWQ / GPTQ instead of Q4_K_M = sometimes better quality at same VRAM. Don't run fp16 unless you have 24 GB+ to spare.

2. CPU offload (last resort, not first)

llama.cpp / Ollama / LM Studio can offload some layers to CPU RAM if VRAM is full. Speed cost: 3-10× slowdown. Use only if a model truly doesn't fit.

3. Reduce context length

Setting max context from 8K to 4K saves significant VRAM. Most chats don't need more. Code agents and document QA do — there's no free lunch there.

4. KV cache quantization

Modern llama.cpp supports Q8/Q4 KV cache. Cuts KV cache size in half with negligible quality impact. Set --cache-type-k q4_0 --cache-type-v q4_0.

5. Skip the refiner / fp16 image gen

SDXL refiner doubles peak VRAM. Disable it if you're on 8 GB. Use Flux.1-schnell (4 steps) instead of Flux.1-dev (20 steps). Quality drop is real but small for many use cases.

6. Use cloud for the rare big-model jobs

Renting an H100 hour costs $2-4 in 2026. If you only need Llama 70B 5 hours/month, cloud is cheaper than buying a 48 GB GPU. Local makes sense for daily use; cloud makes sense for spikes.

How 9bench tells you what fits your VRAM

Run 9bench.com in your browser. The result page detects your GPU via WEBGL_debug_renderer_info, looks up its known VRAM (50+ GPU classes calibrated), and shows you a feasibility table with:

Each row shows YES / MAYBE / NO with calibrated tokens-per-second or seconds-per-image for your specific GPU class. No download. No signup. 15 seconds.

Common questions

"Does the GPU brand matter?" For VRAM calculations, no — GB is GB. For speed, yes: NVIDIA + CUDA gets the most ecosystem support; AMD ROCm works on Linux, is flaky on Windows; Apple Silicon excels at memory capacity but pays a 2-4× speed penalty vs same-VRAM NVIDIA on most workloads.

"Can I add more VRAM to my GPU?" No, not on consumer cards. VRAM is soldered. (There are mod communities for the RTX 3070→16 GB but it's specialty work.) Buy a card with the VRAM you need from the start.

"Is unified memory always 'good enough' as VRAM?" For inference: yes. For training: no. Apple Silicon LLM inference is competitive, but research papers ship CUDA code, and MLX still lacks some kernels. If you do research, NVIDIA. If you do inference + content creation, Apple is fine.

"What about Strix Halo / AMD Ryzen AI Max+ 395?" The new AMD APU with up to 96 GB unified memory (LPDDR5X 8000) is genuinely interesting for local AI in 2026. Performance is closer to M3 Pro than M3 Max in current Linux benchmarks, but VRAM capacity is competitive with M3 Max. Watch this space — by mid-2026 there should be enough public benchmarks to call it definitively.

"How much RAM do I need (separate from VRAM)?" 16 GB system RAM for casual local LLM, 32 GB for serious work, 64 GB if you do offloading. Apple unified memory makes this simpler — one number for everything.

Find out what your GPU can run — 15 seconds, no install

9bench detects your GPU and VRAM, shows feasibility for every popular 2026 local AI workload (LLMs, Flux, video, vision). Calibrated against public benchmarks. No bias.

Test my GPU's VRAM →