- 4 GB: Phi-3-mini, DistilGPT-2, embeddings · nothing serious
- 6-8 GB: Llama 7B Q4, SD 1.5, Whisper · entry-level local AI
- 12 GB: Llama 13B Q4, SDXL, Flux.1-dev NF4, Qwen2-VL, LTX-Video · sweet spot
- 16 GB: Flux.1-dev fp16, HunyuanVideo Q4, multiple models loaded · enthusiast
- 24 GB: Llama 30B Q4, Qwen2.5-Coder 32B Q4, comfortable Flux fp16 · serious
- 48 GB+: Llama 70B Q4, fine-tuning · workstation tier
- 96+ GB unified (M3 Max/Ultra): Llama 70B+, multi-model serving
Every week on r/LocalLLaMA the same thread appears: "I have an X GB GPU, what can I run?" Most answers are guesses, opinions, or links to outdated wikis. This article gives you the calibrated answer — for every popular local AI model in 2026, exactly how much VRAM you need and what tokens/second you'll get on which GPU class.
Numbers come from public llama.cpp / ComfyUI benchmarks, the 9bench GPU-class lookup table (50+ calibrated entries), and r/LocalLLaMA community submissions. Where workloads vary (fp16 vs NF4 vs Q4 quantization), we list each path separately.
The big picture: VRAM tiers in 2026
| VRAM Tier | Example GPUs | What runs comfortably | What's painful | What's impossible |
|---|---|---|---|---|
| 4 GB | GTX 1650, RTX 3050 4GB, Iris Xe (shared) | Phi-3-mini, embeddings, Whisper Tiny | Anything 7B+ | SDXL, Flux, all video models |
| 6 GB | RTX 2060, GTX 1660 Ti, RTX 3050 6GB | Llama 7B Q4 (tight), SD 1.5, Whisper Small | Llama 13B, SDXL with refiner | Flux fp16, Llama 30B+, video gen |
| 8 GB | RTX 3060 8GB, RTX 4060, RX 6600/7600 | Llama 7B Q4, Qwen2-VL 7B, Flux.1-schnell, SDXL (--medvram) | Llama 13B Q4 (no refiner), Flux.1-dev (--lowvram) | Llama 30B+, HunyuanVideo |
| 12 GB ⭐ | RTX 3060 12GB, RTX 4070, RX 7700 XT, M3 Pro | Llama 13B Q4, SDXL+refiner, Flux.1-dev NF4, LTX-Video, HunyuanVideo Q4 | Llama 30B Q4, Flux.1-dev fp16, multiple models loaded | Llama 70B |
| 16 GB | RTX 4060 Ti 16GB, RTX 4080, RX 7800 XT, M3 Pro 18GB | Llama 13B Q4 + LoRA, Flux.1-dev fp16, HunyuanVideo Q4 + ControlNet | Llama 30B Q4 (no LoRA), Flux training | Llama 70B |
| 24 GB | RTX 3090, RTX 4090, RTX 5090 (32GB), RX 7900 XTX, M3 Max 36GB | Llama 30B Q4, Qwen2.5-Coder 32B Q4, Flux training, HunyuanVideo fp16 | Llama 70B Q4 (only with 2× GPUs or offload) | Llama 70B fp16 (need 2× 24GB) |
| 48 GB+ | 2× RTX 3090/4090, M3 Max 64GB, RTX 6000 Ada | Llama 70B Q4, Qwen 72B Q4, full fine-tuning of 7B-13B | Frontier-model fine-tuning | — |
| 96+ GB unified | M3 Max 96GB, M3 Ultra 192GB, M2 Ultra 128GB | Llama 70B+ comfortably, 120B quantized models, multi-model serving | Frontier-research-grade fine-tuning | — |
The 12 GB tier is the 2026 sweet spot. Above it you're paying premium for a small list of additional capabilities (Llama 30B, Flux fp16). Below it you're constantly fighting OOM errors with --lowvram flags and disabled features.
Per-model VRAM requirements (sorted by what you'd actually run)
Text LLMs
| Model | Q4 weights | Peak VRAM | Min GPU | RTX 4090 t/s |
|---|---|---|---|---|
| Phi-3-mini (3.8B) | 2.2 GB | 3-4 GB | any 4 GB+ | ~150-220 |
| Phi-4-mini (4B) | 2.5 GB | 3-4 GB | any 4 GB+ | ~140-200 |
| Qwen 2.5 7B | 4.4 GB | 5-6 GB | RTX 2060 / 3050 6GB | ~95-150 |
| Llama 3.1 / 3.3 8B | 4.7 GB | 5-6 GB | RTX 2060 / 3050 6GB | ~90-145 |
| Mistral 7B | 4.4 GB | 5-6 GB | RTX 2060 / 3050 6GB | ~95-150 |
| Gemma 2 9B | 5.4 GB | 7-8 GB | RTX 4060 / 3060 8GB | ~75-115 |
| Llama 13B | 7.4 GB | 9-10 GB | RTX 3060 12GB / RTX 4070 | ~50-75 |
| Mistral Small 22B | 13 GB | 15-16 GB | RTX 4060 Ti 16GB / 4080 | ~38-55 |
| Codestral 22B | 13 GB | 15-16 GB | RTX 4060 Ti 16GB / 4080 | ~38-55 |
| Qwen2.5-Coder 32B | 19 GB | 22-23 GB | RTX 3090 / 4090 / 5090 | ~35-45 |
| Mistral Large 2 (123B Q3) | 50 GB | 56 GB | 2× RTX 4090 / M3 Ultra | ~12-18 |
| Llama 3.3 70B | 40 GB | 46-48 GB | 2× RTX 3090/4090, M3 Max 64GB | ~8-15 (Mac) / 50-80 (2×4090) |
| Qwen 2.5 72B | 43 GB | 48-50 GB | 2× RTX 3090/4090, M3 Max 64GB+ | ~8-14 |
Vision / multimodal
| Model | Q4 weights | Peak VRAM | Min GPU | RTX 4090 t/s |
|---|---|---|---|---|
| Qwen2-VL 2B | 1.4 GB | 3-4 GB | any 4 GB+ | ~120-180 |
| Qwen2-VL 7B | 4.5 GB | 6-7 GB | RTX 2060 / 3060 / 4060 | ~50-70 |
| Llama 3.2 Vision 11B | 6.5 GB | 9-10 GB | RTX 3060 12GB / RTX 4070 | ~40-55 |
| Pixtral 12B | 7.5 GB | 10-12 GB | RTX 4070 / RX 7700 XT | ~35-50 |
| Llama 3.2 Vision 90B | 52 GB | 56-60 GB | 2× RTX 4090, M3 Ultra | ~6-10 |
Image generation
| Model | Quant Path | Peak VRAM | Min GPU | RTX 4090 sec/image |
|---|---|---|---|---|
| SD 1.5 (512²) | fp16 | 3-4 GB | any 4 GB+ | ~1-2s |
| SDXL (1024²) | fp16 | 7-8 GB (no refiner) | RTX 4060 / 3060 8GB | ~3-5s |
| SDXL + refiner (1024²) | fp16 | 11-12 GB | RTX 3060 12GB / 4070 | ~5-8s |
| Flux.1-schnell (1024², 4 steps) | fp8 / NF4 | 7-8 GB | RTX 4060 / 3060 8GB | ~2-3s |
| Flux.1-dev (1024², 20 steps) | NF4 / GGUF Q4 | 10-12 GB | RTX 3060 12GB / 4070 | ~12-18s |
| Flux.1-dev (1024², 20 steps) | fp16 | 22-24 GB | RTX 3090 / 4090 | ~10-15s |
| SD3.5 Medium | fp16 | 10-11 GB | RTX 3060 12GB / 4070 | ~8-12s |
| SD3.5 Large | fp8 | 14-16 GB | RTX 4060 Ti 16GB / 4080 | ~12-18s |
Video generation (the new frontier in 2026)
| Model | Quant Path | Peak VRAM | Min GPU | RTX 4090 time |
|---|---|---|---|---|
| LTX-Video (5s 768×512) | fp16 / fp8 | 7-8 GB | RTX 4060 / 3060 8GB | ~20-40s |
| HunyuanVideo (5s 720p, Q4) | GGUF Q4 | 11-12 GB | RTX 3060 12GB / 4070 | ~4-6 min |
| HunyuanVideo (5s 720p) | bf16 | 22-24 GB | RTX 3090 / 4090 | ~3-5 min |
| Mochi-1 Preview (5s 480p) | fp8 | 22-24 GB | RTX 3090 / 4090 | ~5-8 min |
| CogVideoX 5B | fp16 | 10-11 GB | RTX 3060 12GB / 4070 | ~8-15 min |
Audio / TTS
| Model | Peak VRAM | Min GPU | Speed (RTX 4090) |
|---|---|---|---|
| Whisper Tiny (39M) | 0.5 GB | any (CPU OK) | ~50× realtime |
| Whisper Small (244M) | 1 GB | any 2 GB+ | ~10-20× realtime |
| Whisper Large-v3 (1.5B) | 3-4 GB | RTX 3050 / 1660 Ti+ | ~5-8× realtime |
| F5-TTS | 5-6 GB | RTX 3060 / 2060 / 4060 | ~15-25× realtime |
| XTTSv2 / Coqui | 4-5 GB | RTX 3060 / 2060 | ~10-20× realtime |
Quantization 101 — why "Q4" matters more than "GB"
All the numbers above assume Q4 quantization (also written Q4_K_M or NF4 depending on framework). Q4 means each weight is stored as a 4-bit integer with a scaling factor instead of fp16 (16 bits). That's a 4× memory reduction with ~95% quality retention for most models.
Other quantization paths to know:
- fp16 / bf16: Full quality. 2× the VRAM of Q4. Use only if VRAM allows.
- fp8: 8-bit floating point. ~98% quality of fp16, 1.5× VRAM of Q4. Modern Hopper / Ada GPUs accelerate this directly.
- Q4_K_M: Most popular Q4 variant (llama.cpp). Balanced quality.
- NF4: Used by Bitsandbytes / Forge / ComfyUI for Flux. Similar to Q4_K_M.
- GGUF Q4: Same Q4 quality, packaged in llama.cpp's GGUF format. Most local LLM tools use this.
- AWQ / GPTQ: Activation-aware quants that beat Q4_K_M on quality benchmarks but require pre-computed quants per model. Usable in vLLM, ExLlamaV2.
- Q3 / Q2: 3-bit / 2-bit. Q3 is borderline usable for 70B+ models when 24 GB VRAM is your only option. Q2 is usually too lossy for serious work.
Practical rule of thumb: Q4_K_M is what 95% of local LLM users run. Don't worry about quant choices unless you're squeezing a 70B into 24 GB (try Q3) or optimizing throughput on a 24 GB+ workstation (try AWQ).
The KV cache trap
"But the weights are only 4.5 GB, why does my 7B model fill 8 GB VRAM?"
Answer: KV cache. Each token your model generates has to remember the keys and values from prior tokens (this is what makes attention work). The KV cache grows with context length:
- Llama 7B at 2K context: ~0.5 GB KV cache
- Llama 7B at 8K context: ~2 GB KV cache
- Llama 7B at 32K context: ~8 GB KV cache (yes, more than the weights!)
- Llama 70B at 8K context: ~5 GB KV cache
For most chat use cases, 4K context is plenty and KV cache is small. But if you're doing long-document summarization, code-base RAG, or 32K-token "agentic loops", VRAM requirements can double. Modern frameworks support KV cache quantization (Q4 KV cache) which cuts this in half with minimal quality loss.
Apple Silicon: unified memory works differently
On Apple Silicon (M1/M2/M3/M4 series), there's no separate "VRAM" — the GPU shares system RAM via unified memory architecture. By default, macOS reserves about 75% of total RAM for GPU use. So:
- M3 Max 36 GB: ~27 GB usable as VRAM
- M3 Max 64 GB: ~48 GB usable as VRAM (runs Llama 70B!)
- M3 Max 96 GB: ~72 GB usable
- M3 Max 128 GB: ~96 GB usable
- M3 Ultra 192 GB: ~144 GB usable (runs 120B+ models)
You can manually raise the GPU memory limit with sudo sysctl iogpu.wired_limit_mb=<value>.
Most Mac LLM tools (LM Studio, Ollama, MLX) document this. It's the single most-skipped
setup step on Mac.
What if you're VRAM-poor? The workarounds that actually work
1. Quantization is your friend
Q4 instead of fp16 = 4× less VRAM. AWQ / GPTQ instead of Q4_K_M = sometimes better quality at same VRAM. Don't run fp16 unless you have 24 GB+ to spare.
2. CPU offload (last resort, not first)
llama.cpp / Ollama / LM Studio can offload some layers to CPU RAM if VRAM is full. Speed cost: 3-10× slowdown. Use only if a model truly doesn't fit.
3. Reduce context length
Setting max context from 8K to 4K saves significant VRAM. Most chats don't need more. Code agents and document QA do — there's no free lunch there.
4. KV cache quantization
Modern llama.cpp supports Q8/Q4 KV cache. Cuts KV cache size in half with negligible
quality impact. Set --cache-type-k q4_0 --cache-type-v q4_0.
5. Skip the refiner / fp16 image gen
SDXL refiner doubles peak VRAM. Disable it if you're on 8 GB. Use Flux.1-schnell (4 steps) instead of Flux.1-dev (20 steps). Quality drop is real but small for many use cases.
6. Use cloud for the rare big-model jobs
Renting an H100 hour costs $2-4 in 2026. If you only need Llama 70B 5 hours/month, cloud is cheaper than buying a 48 GB GPU. Local makes sense for daily use; cloud makes sense for spikes.
How 9bench tells you what fits your VRAM
Run 9bench.com in your browser. The result page detects your GPU via WEBGL_debug_renderer_info, looks up its known VRAM (50+ GPU classes calibrated), and shows you a feasibility table with:
- Llama 7B / 13B / 70B / Qwen2.5-Coder 32B / Qwen2-VL 7B (text + multimodal LLMs)
- SDXL, SD 1.5, Flux.1-dev, Flux.1-schnell (image gen)
- HunyuanVideo, LTX-Video (video gen)
- Whisper (audio)
Each row shows YES / MAYBE / NO with calibrated tokens-per-second or seconds-per-image for your specific GPU class. No download. No signup. 15 seconds.
Common questions
"Does the GPU brand matter?" For VRAM calculations, no — GB is GB. For speed, yes: NVIDIA + CUDA gets the most ecosystem support; AMD ROCm works on Linux, is flaky on Windows; Apple Silicon excels at memory capacity but pays a 2-4× speed penalty vs same-VRAM NVIDIA on most workloads.
"Can I add more VRAM to my GPU?" No, not on consumer cards. VRAM is soldered. (There are mod communities for the RTX 3070→16 GB but it's specialty work.) Buy a card with the VRAM you need from the start.
"Is unified memory always 'good enough' as VRAM?" For inference: yes. For training: no. Apple Silicon LLM inference is competitive, but research papers ship CUDA code, and MLX still lacks some kernels. If you do research, NVIDIA. If you do inference + content creation, Apple is fine.
"What about Strix Halo / AMD Ryzen AI Max+ 395?" The new AMD APU with up to 96 GB unified memory (LPDDR5X 8000) is genuinely interesting for local AI in 2026. Performance is closer to M3 Pro than M3 Max in current Linux benchmarks, but VRAM capacity is competitive with M3 Max. Watch this space — by mid-2026 there should be enough public benchmarks to call it definitively.
"How much RAM do I need (separate from VRAM)?" 16 GB system RAM for casual local LLM, 32 GB for serious work, 64 GB if you do offloading. Apple unified memory makes this simpler — one number for everything.
Find out what your GPU can run — 15 seconds, no install
9bench detects your GPU and VRAM, shows feasibility for every popular 2026 local AI workload (LLMs, Flux, video, vision). Calibrated against public benchmarks. No bias.
Test my GPU's VRAM →