How much VRAM do I need to run Llama 3.3 70B locally?

For Q4_K_M (4-bit) quantization, Llama 3.3 70B weights are ~40 GB plus 4-8 GB for KV cache and activations — so 48 GB VRAM minimum for comfortable inference. RTX 4090 (24 GB) cannot fit it without offload (which makes it 1-3 t/s, painful). Two RTX 3090s/4090s in parallel (48 GB total) work. Apple M3 Max 64 GB unified memory is the only consumer laptop option.

What's the minimum VRAM for Llama 7B?

5 GB practical minimum for Q4 quantization (4.5 GB weights + KV cache). 8 GB GPUs work comfortably. 6 GB GPUs (RTX 2060, GTX 1060 6GB) work tight — disable refiner-style features. CPU-only with system RAM also runs Llama 7B Q4 at 4-12 t/s on modern Ryzen / Intel.

Is 12 GB VRAM enough for local AI in 2026?

Yes for almost everything except 30B+ LLMs and Flux.1-dev fp16. 12 GB runs Llama 13B Q4 comfortably, SDXL with refiner, Flux.1-dev NF4-quantized, Qwen2-VL 7B, LTX-Video, and HunyuanVideo Q4. The 12 GB tier (RTX 3060 12GB, RTX 4070, RX 7700 XT) is the sweet spot for serious AI hobbyists in 2026.

Can I run Flux.1-dev on 8 GB VRAM?

Tight. Flux.1-dev NF4-quantized uses ~10 GB VRAM at peak. With Forge's --lowvram or ComfyUI's sequential CPU offload it works but generates 60-180 seconds per 1024×1024 image. Better path: use Flux.1-schnell (4 steps, fits comfortably in 8 GB) for fast iteration.

What about Apple Silicon — does unified memory count as VRAM?

Yes, mostly. macOS reserves ~75% of total RAM for GPU use by default (raisable via sudo sysctl). So an M3 Max 64 GB has effectively 48 GB usable as VRAM. This is why M3 Max can run Llama 70B that no consumer NVIDIA card under $4000 can fit.

Will future models need more VRAM?

Mixed. Frontier models (Llama 4, Qwen 3) get bigger, but quantization keeps improving — Q3, AWQ, and GPTQ now achieve near-Q4 quality. Speculative decoding doesn't add much VRAM. Realistic 2027 expectation: 16 GB VRAM remains the practical floor for serious work, 24 GB is comfortable, 48 GB+ becomes the new "enthusiast" line.

How Much VRAM Do You Need for Local LLMs in 2026? (Tested + Tabled)

TL;DR — VRAM Requirements 2026 Cheat Sheet

4 GB: Phi-3-mini, DistilGPT-2, embeddings · nothing serious
6-8 GB: Llama 7B Q4, SD 1.5, Whisper · entry-level local AI
12 GB: Llama 13B Q4, SDXL, Flux.1-dev NF4, Qwen2-VL, LTX-Video · sweet spot
16 GB: Flux.1-dev fp16, HunyuanVideo Q4, multiple models loaded · enthusiast
24 GB: Llama 30B Q4, Qwen2.5-Coder 32B Q4, comfortable Flux fp16 · serious
48 GB+: Llama 70B Q4, fine-tuning · workstation tier
96+ GB unified (M3 Max/Ultra): Llama 70B+, multi-model serving

Want to know what your hardware fits? Run 9bench (15s, browser-only) — it detects your GPU and shows VRAM-feasible models.

Every week on r/LocalLLaMA the same thread appears: "I have an X GB GPU, what can I run?" Most answers are guesses, opinions, or links to outdated wikis. This article gives you the calibrated answer — for every popular local AI model in 2026, exactly how much VRAM you need and what tokens/second you'll get on which GPU class.

Numbers come from public llama.cpp / ComfyUI benchmarks, the 9bench GPU-class lookup table (50+ calibrated entries), and r/LocalLLaMA community submissions. Where workloads vary (fp16 vs NF4 vs Q4 quantization), we list each path separately.

📐 What "VRAM needs X GB" actually means

Every number below is peak VRAM during inference, including model weights, KV cache, attention buffers, and CUDA / Metal kernel overhead. Quoting just "weights size" misleads — the real bottleneck is peak allocation, not idle. We err on the side of "comfortable" rather than "barely fits".

The big picture: VRAM tiers in 2026

VRAM Tier	Example GPUs	What runs comfortably	What's painful	What's impossible
4 GB	GTX 1650, RTX 3050 4GB, Iris Xe (shared)	Phi-3-mini, embeddings, Whisper Tiny	Anything 7B+	SDXL, Flux, all video models
6 GB	RTX 2060, GTX 1660 Ti, RTX 3050 6GB	Llama 7B Q4 (tight), SD 1.5, Whisper Small	Llama 13B, SDXL with refiner	Flux fp16, Llama 30B+, video gen
8 GB	RTX 3060 8GB, RTX 4060, RX 6600/7600	Llama 7B Q4, Qwen2-VL 7B, Flux.1-schnell, SDXL (--medvram)	Llama 13B Q4 (no refiner), Flux.1-dev (--lowvram)	Llama 30B+, HunyuanVideo
12 GB ⭐	RTX 3060 12GB, RTX 4070, RX 7700 XT, M3 Pro	Llama 13B Q4, SDXL+refiner, Flux.1-dev NF4, LTX-Video, HunyuanVideo Q4	Llama 30B Q4, Flux.1-dev fp16, multiple models loaded	Llama 70B
16 GB	RTX 4060 Ti 16GB, RTX 4080, RX 7800 XT, M3 Pro 18GB	Llama 13B Q4 + LoRA, Flux.1-dev fp16, HunyuanVideo Q4 + ControlNet	Llama 30B Q4 (no LoRA), Flux training	Llama 70B
24 GB	RTX 3090, RTX 4090, RTX 5090 (32GB), RX 7900 XTX, M3 Max 36GB	Llama 30B Q4, Qwen2.5-Coder 32B Q4, Flux training, HunyuanVideo fp16	Llama 70B Q4 (only with 2× GPUs or offload)	Llama 70B fp16 (need 2× 24GB)
48 GB+	2× RTX 3090/4090, M3 Max 64GB, RTX 6000 Ada	Llama 70B Q4, Qwen 72B Q4, full fine-tuning of 7B-13B	Frontier-model fine-tuning	—
96+ GB unified	M3 Max 96GB, M3 Ultra 192GB, M2 Ultra 128GB	Llama 70B+ comfortably, 120B quantized models, multi-model serving	Frontier-research-grade fine-tuning	—

The 12 GB tier is the 2026 sweet spot. Above it you're paying premium for a small list of additional capabilities (Llama 30B, Flux fp16). Below it you're constantly fighting OOM errors with --lowvram flags and disabled features.

Per-model VRAM requirements (sorted by what you'd actually run)

Text LLMs

Model	Q4 weights	Peak VRAM	Min GPU	RTX 4090 t/s
Phi-3-mini (3.8B)	2.2 GB	3-4 GB	any 4 GB+	~150-220
Phi-4-mini (4B)	2.5 GB	3-4 GB	any 4 GB+	~140-200
Qwen 2.5 7B	4.4 GB	5-6 GB	RTX 2060 / 3050 6GB	~95-150
Llama 3.1 / 3.3 8B	4.7 GB	5-6 GB	RTX 2060 / 3050 6GB	~90-145
Mistral 7B	4.4 GB	5-6 GB	RTX 2060 / 3050 6GB	~95-150
Gemma 2 9B	5.4 GB	7-8 GB	RTX 4060 / 3060 8GB	~75-115
Llama 13B	7.4 GB	9-10 GB	RTX 3060 12GB / RTX 4070	~50-75
Mistral Small 22B	13 GB	15-16 GB	RTX 4060 Ti 16GB / 4080	~38-55
Codestral 22B	13 GB	15-16 GB	RTX 4060 Ti 16GB / 4080	~38-55
Qwen2.5-Coder 32B	19 GB	22-23 GB	RTX 3090 / 4090 / 5090	~35-45
Mistral Large 2 (123B Q3)	50 GB	56 GB	2× RTX 4090 / M3 Ultra	~12-18
Llama 3.3 70B	40 GB	46-48 GB	2× RTX 3090/4090, M3 Max 64GB	~8-15 (Mac) / 50-80 (2×4090)
Qwen 2.5 72B	43 GB	48-50 GB	2× RTX 3090/4090, M3 Max 64GB+	~8-14

Vision / multimodal

Model	Q4 weights	Peak VRAM	Min GPU	RTX 4090 t/s
Qwen2-VL 2B	1.4 GB	3-4 GB	any 4 GB+	~120-180
Qwen2-VL 7B	4.5 GB	6-7 GB	RTX 2060 / 3060 / 4060	~50-70
Llama 3.2 Vision 11B	6.5 GB	9-10 GB	RTX 3060 12GB / RTX 4070	~40-55
Pixtral 12B	7.5 GB	10-12 GB	RTX 4070 / RX 7700 XT	~35-50
Llama 3.2 Vision 90B	52 GB	56-60 GB	2× RTX 4090, M3 Ultra	~6-10

Image generation

Model	Quant Path	Peak VRAM	Min GPU	RTX 4090 sec/image
SD 1.5 (512²)	fp16	3-4 GB	any 4 GB+	~1-2s
SDXL (1024²)	fp16	7-8 GB (no refiner)	RTX 4060 / 3060 8GB	~3-5s
SDXL + refiner (1024²)	fp16	11-12 GB	RTX 3060 12GB / 4070	~5-8s
Flux.1-schnell (1024², 4 steps)	fp8 / NF4	7-8 GB	RTX 4060 / 3060 8GB	~2-3s
Flux.1-dev (1024², 20 steps)	NF4 / GGUF Q4	10-12 GB	RTX 3060 12GB / 4070	~12-18s
Flux.1-dev (1024², 20 steps)	fp16	22-24 GB	RTX 3090 / 4090	~10-15s
SD3.5 Medium	fp16	10-11 GB	RTX 3060 12GB / 4070	~8-12s
SD3.5 Large	fp8	14-16 GB	RTX 4060 Ti 16GB / 4080	~12-18s

Video generation (the new frontier in 2026)

Model	Quant Path	Peak VRAM	Min GPU	RTX 4090 time
LTX-Video (5s 768×512)	fp16 / fp8	7-8 GB	RTX 4060 / 3060 8GB	~20-40s
HunyuanVideo (5s 720p, Q4)	GGUF Q4	11-12 GB	RTX 3060 12GB / 4070	~4-6 min
HunyuanVideo (5s 720p)	bf16	22-24 GB	RTX 3090 / 4090	~3-5 min
Mochi-1 Preview (5s 480p)	fp8	22-24 GB	RTX 3090 / 4090	~5-8 min
CogVideoX 5B	fp16	10-11 GB	RTX 3060 12GB / 4070	~8-15 min

Audio / TTS

Model	Peak VRAM	Min GPU	Speed (RTX 4090)
Whisper Tiny (39M)	0.5 GB	any (CPU OK)	~50× realtime
Whisper Small (244M)	1 GB	any 2 GB+	~10-20× realtime
Whisper Large-v3 (1.5B)	3-4 GB	RTX 3050 / 1660 Ti+	~5-8× realtime
F5-TTS	5-6 GB	RTX 3060 / 2060 / 4060	~15-25× realtime
XTTSv2 / Coqui	4-5 GB	RTX 3060 / 2060	~10-20× realtime

Quantization 101 — why "Q4" matters more than "GB"

All the numbers above assume Q4 quantization (also written Q4_K_M or NF4 depending on framework). Q4 means each weight is stored as a 4-bit integer with a scaling factor instead of fp16 (16 bits). That's a 4× memory reduction with ~95% quality retention for most models.

Other quantization paths to know:

fp16 / bf16: Full quality. 2× the VRAM of Q4. Use only if VRAM allows.
fp8: 8-bit floating point. ~98% quality of fp16, 1.5× VRAM of Q4. Modern Hopper / Ada GPUs accelerate this directly.
Q4_K_M: Most popular Q4 variant (llama.cpp). Balanced quality.
NF4: Used by Bitsandbytes / Forge / ComfyUI for Flux. Similar to Q4_K_M.
GGUF Q4: Same Q4 quality, packaged in llama.cpp's GGUF format. Most local LLM tools use this.
AWQ / GPTQ: Activation-aware quants that beat Q4_K_M on quality benchmarks but require pre-computed quants per model. Usable in vLLM, ExLlamaV2.
Q3 / Q2: 3-bit / 2-bit. Q3 is borderline usable for 70B+ models when 24 GB VRAM is your only option. Q2 is usually too lossy for serious work.

Practical rule of thumb: Q4_K_M is what 95% of local LLM users run. Don't worry about quant choices unless you're squeezing a 70B into 24 GB (try Q3) or optimizing throughput on a 24 GB+ workstation (try AWQ).

The KV cache trap

"But the weights are only 4.5 GB, why does my 7B model fill 8 GB VRAM?"

Answer: KV cache. Each token your model generates has to remember the keys and values from prior tokens (this is what makes attention work). The KV cache grows with context length:

Llama 7B at 2K context: ~0.5 GB KV cache
Llama 7B at 8K context: ~2 GB KV cache
Llama 7B at 32K context: ~8 GB KV cache (yes, more than the weights!)
Llama 70B at 8K context: ~5 GB KV cache

For most chat use cases, 4K context is plenty and KV cache is small. But if you're doing long-document summarization, code-base RAG, or 32K-token "agentic loops", VRAM requirements can double. Modern frameworks support KV cache quantization (Q4 KV cache) which cuts this in half with minimal quality loss.

⚠️ Why some "VRAM calculator" sites are wrong

Tools that compute "VRAM = weights × 1.2" miss KV cache, attention buffers, and CUDA overhead. Real-world peak VRAM is typically 1.4-1.6× the weight size. Always allow 30-40% headroom over the bare model size. The numbers in our tables above include this.

Apple Silicon: unified memory works differently

On Apple Silicon (M1/M2/M3/M4 series), there's no separate "VRAM" — the GPU shares system RAM via unified memory architecture. By default, macOS reserves about 75% of total RAM for GPU use. So:

M3 Max 36 GB: ~27 GB usable as VRAM
M3 Max 64 GB: ~48 GB usable as VRAM (runs Llama 70B!)
M3 Max 96 GB: ~72 GB usable
M3 Max 128 GB: ~96 GB usable
M3 Ultra 192 GB: ~144 GB usable (runs 120B+ models)

You can manually raise the GPU memory limit with sudo sysctl iogpu.wired_limit_mb=<value>. Most Mac LLM tools (LM Studio, Ollama, MLX) document this. It's the single most-skipped setup step on Mac.

What if you're VRAM-poor? The workarounds that actually work

1. Quantization is your friend

Q4 instead of fp16 = 4× less VRAM. AWQ / GPTQ instead of Q4_K_M = sometimes better quality at same VRAM. Don't run fp16 unless you have 24 GB+ to spare.

2. CPU offload (last resort, not first)

llama.cpp / Ollama / LM Studio can offload some layers to CPU RAM if VRAM is full. Speed cost: 3-10× slowdown. Use only if a model truly doesn't fit.

3. Reduce context length

Setting max context from 8K to 4K saves significant VRAM. Most chats don't need more. Code agents and document QA do — there's no free lunch there.

4. KV cache quantization

Modern llama.cpp supports Q8/Q4 KV cache. Cuts KV cache size in half with negligible quality impact. Set --cache-type-k q4_0 --cache-type-v q4_0.

5. Skip the refiner / fp16 image gen

SDXL refiner doubles peak VRAM. Disable it if you're on 8 GB. Use Flux.1-schnell (4 steps) instead of Flux.1-dev (20 steps). Quality drop is real but small for many use cases.

6. Use cloud for the rare big-model jobs

Renting an H100 hour costs $2-4 in 2026. If you only need Llama 70B 5 hours/month, cloud is cheaper than buying a 48 GB GPU. Local makes sense for daily use; cloud makes sense for spikes.

How 9bench tells you what fits your VRAM

Run 9bench.com in your browser. The result page detects your GPU via WEBGL_debug_renderer_info, looks up its known VRAM (50+ GPU classes calibrated), and shows you a feasibility table with:

Llama 7B / 13B / 70B / Qwen2.5-Coder 32B / Qwen2-VL 7B (text + multimodal LLMs)
SDXL, SD 1.5, Flux.1-dev, Flux.1-schnell (image gen)
HunyuanVideo, LTX-Video (video gen)
Whisper (audio)

Each row shows YES / MAYBE / NO with calibrated tokens-per-second or seconds-per-image for your specific GPU class. No download. No signup. 15 seconds.

Common questions

"Does the GPU brand matter?" For VRAM calculations, no — GB is GB. For speed, yes: NVIDIA + CUDA gets the most ecosystem support; AMD ROCm works on Linux, is flaky on Windows; Apple Silicon excels at memory capacity but pays a 2-4× speed penalty vs same-VRAM NVIDIA on most workloads.

"Can I add more VRAM to my GPU?" No, not on consumer cards. VRAM is soldered. (There are mod communities for the RTX 3070→16 GB but it's specialty work.) Buy a card with the VRAM you need from the start.

"Is unified memory always 'good enough' as VRAM?" For inference: yes. For training: no. Apple Silicon LLM inference is competitive, but research papers ship CUDA code, and MLX still lacks some kernels. If you do research, NVIDIA. If you do inference + content creation, Apple is fine.

"What about Strix Halo / AMD Ryzen AI Max+ 395?" The new AMD APU with up to 96 GB unified memory (LPDDR5X 8000) is genuinely interesting for local AI in 2026. Performance is closer to M3 Pro than M3 Max in current Linux benchmarks, but VRAM capacity is competitive with M3 Max. Watch this space — by mid-2026 there should be enough public benchmarks to call it definitively.

"How much RAM do I need (separate from VRAM)?" 16 GB system RAM for casual local LLM, 32 GB for serious work, 64 GB if you do offloading. Apple unified memory makes this simpler — one number for everything.

Find out what your GPU can run — 15 seconds, no install

9bench detects your GPU and VRAM, shows feasibility for every popular 2026 local AI workload (LLMs, Flux, video, vision). Calibrated against public benchmarks. No bias.

Test my GPU's VRAM →

How Much VRAM Do You Need for Local LLMs in 2026? (Tested + Tabled)

The big picture: VRAM tiers in 2026

Per-model VRAM requirements (sorted by what you'd actually run)

Text LLMs

Vision / multimodal

Image generation

Video generation (the new frontier in 2026)

Audio / TTS

Quantization 101 — why "Q4" matters more than "GB"

The KV cache trap

Apple Silicon: unified memory works differently

What if you're VRAM-poor? The workarounds that actually work

1. Quantization is your friend

2. CPU offload (last resort, not first)

3. Reduce context length

4. KV cache quantization

5. Skip the refiner / fp16 image gen

6. Use cloud for the rare big-model jobs

How 9bench tells you what fits your VRAM

Common questions

Find out what your GPU can run — 15 seconds, no install

Test your hardware in 15 seconds

Frequently asked

The big picture: VRAM tiers in 2026

Per-model VRAM requirements (sorted by what you'd actually run)

Text LLMs

Vision / multimodal

Image generation

Video generation (the new frontier in 2026)

Audio / TTS

Quantization 101 — why "Q4" matters more than "GB"

The KV cache trap

Apple Silicon: unified memory works differently

What if you're VRAM-poor? The workarounds that actually work

1. Quantization is your friend

2. CPU offload (last resort, not first)

3. Reduce context length

4. KV cache quantization

5. Skip the refiner / fp16 image gen

6. Use cloud for the rare big-model jobs

How 9bench tells you what fits your VRAM

Common questions

Find out what your GPU can run — 15 seconds, no install

Test your hardware in 15 seconds

Frequently asked

Related articles