TL;DR — What 8 GB VRAM actually buys you in 2026
8 GB VRAM is the 2026 floor for usable local AI. Not the comfort zone — but you can do a surprising amount with it. Test your specific GPU →

"I have an RTX 4060 / 3060 / 3070, what can I run?" is the most-asked question on r/LocalLLaMA in 2026. The honest answer involves a lot of "yes, but" qualifications. This article gives you the unvarnished version — what 8 GB VRAM actually delivers, where it falls apart, and which workarounds are worth using.

All numbers are calibrated against public llama.cpp / ComfyUI benchmarks plus the 9bench GPU-class lookup table for RTX 4060, RTX 3060 8GB, RTX 3070, and AMD RX 7600.

📐 What "8 GB VRAM" cards we're testing
Throughout this article, "8 GB" means the popular consumer cards: RTX 4060 (8 GB GDDR6), RTX 3060 8GB, RTX 4060 Laptop, RTX 3070, RX 7600, RX 6600, Apple M2/M3 base 8 GB unified. Speed varies between them but VRAM ceilings are identical.

What 8 GB runs comfortably (the good news)

1. 7B / 8B class LLMs — the sweet spot

Modern 7-9B models with Q4 quantization fit in 5-6 GB peak VRAM, leaving 2-3 GB headroom for KV cache and OS overhead. Performance on RTX 4060:

For chat / RAG / code-completion / summarization use cases, 7B-class models running on 8 GB VRAM are genuinely competitive with cloud GPT-3.5-class APIs. The era of "you need a 4090 to do anything" is over.

2. SDXL without refiner — fine

SDXL base model alone fits in 7-8 GB VRAM. RTX 4060 generates 14-22 seconds per 1024×1024 image. Slower than RTX 4090's 3-5s but very usable.

Skip the refiner — it doubles peak VRAM and would force --medvram. The base model alone produces solid images. If you absolutely need refiner quality, ComfyUI with sequential CPU offload works at 25-40s per image.

3. Flux.1-schnell — yes, on 8 GB!

Flux.1-schnell is the speed-optimized 4-step variant of Flux.1-dev. Designed for 8 GB cards. Performance:

Quality is slightly below Flux.1-dev but dramatically better than SDXL. For 8 GB owners who want the modern image-gen experience, Flux.1-schnell is the answer.

4. Whisper transcription — easy

Whisper Large-v3 (1.5B params) uses 3-4 GB VRAM. Whisper Small uses ~1 GB. Both run comfortably with massive headroom on 8 GB. RTX 4060 transcribes 1 hour of audio in ~5-7 minutes. Fine for podcast/meeting transcription.

5. Qwen2-VL 7B — multimodal works

Qwen2-VL 7B (vision-language model) needs ~6-7 GB VRAM. Fits comfortably on 8 GB. RTX 4060 runs OCR / image-description / screenshot-reading at 30-50 tokens/sec. For local agentic workflows that need to "see" the screen, this is the model that actually works on consumer hardware in 2026.

6. LTX-Video — yes, video generation on 8 GB!

The most surprising 2026 capability: local AI video generation on 8 GB cards. LTX-Video (Lightricks 2B) generates 5-second 768×512 clips in:

Quality isn't HunyuanVideo level, but for fast iteration / drafts / b-roll, an 8 GB card is enough to prototype video AI workflows.

What's tight on 8 GB (the "yes but" zone)

1. Llama 13B Q4 — works with --lowvram

13B Q4 weights are 7.4 GB. Add KV cache and you peak at 9-10 GB — over the 8 GB ceiling. Workarounds:

Honest verdict: 13B Q4 on 8 GB is "it works, sort of". If you really want 13B comfortable, the path is to upgrade to 12 GB+, not to fight --lowvram.

2. Flux.1-dev — possible but painful

NF4-quantized Flux.1-dev peaks at ~10 GB during inference. On 8 GB you need:

Better path: use Flux.1-schnell for daily work, switch to cloud (Replicate, Hugging Face Spaces ~$0.01/image) for the rare Flux.1-dev quality job. Don't burn time on workarounds.

3. SDXL with refiner

Refiner adds ~3-4 GB peak VRAM. With base + refiner + ControlNet + LoRA, you're at 11-12 GB peak which doesn't fit. Workaround: --medvram in Forge handles this at 25-40 sec per image instead of native 5-8s on a 12 GB card. Usable but slow.

4. HunyuanVideo Q4 — works, very slow

HunyuanVideo Q4 peaks at ~11-12 GB. On 8 GB you'd need aggressive offload — generation times balloon to 15-30 minutes per 5-second clip. Realistically, skip HunyuanVideo on 8 GB cards. Use LTX-Video (designed for 8 GB) instead.

What's painful on 8 GB (don't bother)

1. Llama 30B / Mistral Small 22B / Codestral 22B

Mid-tier models (22B-32B) are the awkward zone. Q4 weights are 13-19 GB, way over 8 GB. Heavy CPU offload makes them generate at 1-3 t/s — slower than human reading speed. Solution: stick to 7B-13B class on 8 GB, save 30B+ for when you upgrade.

2. Qwen2.5-Coder 32B (the local-coding-agent flagship)

Q4 weights alone are 19 GB. There's no realistic 8 GB path. If you want local coding agents, you need 24 GB VRAM (RTX 3090 used $700-900, RTX 4090, M3 Max 36GB+). Don't try to force Qwen2.5-Coder onto 8 GB — DeepSeek-Coder-V2-Lite (16B MoE, ~12 GB) or smaller models like Qwen2.5-Coder 7B are the realistic alternatives.

3. Full fine-tuning of any model

Full fine-tuning needs 4-6× the model size in VRAM (gradients + optimizer states). Even 7B full fine-tuning needs 30+ GB. Use cloud GPUs for this — RunPod / Together rents A100 40GB at $1-2/hour.

What's impossible on 8 GB (no amount of workaround helps)

The 8 GB workflow that actually works

Stop fighting the VRAM ceiling. Pick the right tools for 8 GB:

  1. Default LLM: Qwen 2.5 7B Q4_K_M for chat / RAG / code, Phi-4-mini Q4 for autocomplete / fast tasks. Both via Ollama or LM Studio.
  2. Default image gen: Flux.1-schnell for new images, SDXL base (no refiner) for variety. ComfyUI is the easiest UI; Forge is fine too.
  3. Default video gen: LTX-Video. Don't even try HunyuanVideo on 8 GB.
  4. Default vision: Qwen2-VL 7B Q4. For OCR / screenshot agents / image descriptions.
  5. Default audio: Whisper Large-v3 for transcription, F5-TTS for voice cloning.
  6. For 13B+ or Flux fp16: rent a cloud GPU for the rare job. Don't waste hours on --lowvram workarounds for 5 generations.
⚠️ The browser-shared-RAM trap (laptops)
Many laptops with "8 GB VRAM" are actually using shared system RAM (Intel Iris Xe, low-tier AMD APUs). These have nothing close to 8 GB usable VRAM in practice — typically 4-6 GB max after driver overhead. If your laptop has only one RAM number listed (e.g., "16 GB unified"), you're in shared-memory territory. Treat it as 4 GB VRAM, not 8.

Should you upgrade from 8 GB?

Cost-benefit framework for upgrading specifically from 8 GB to 12 / 16 / 24 GB:

8 GB → 12 GB (e.g., RTX 4060 → used RTX 3060 12GB)

Cost: ~$200 used 3060 12GB. Benefit: unlocks Llama 13B Q4 comfortable, SDXL + refiner, Flux.1-dev NF4 (12-25s vs 60-180s), HunyuanVideo Q4. Massive value-per-dollar. If AI is a primary use case and your card is 8 GB, this is the highest-ROI upgrade in consumer hardware.

8 GB → 16 GB (RTX 4060 Ti 16GB / RX 7800 XT)

Cost: ~$450-550 new. Benefit: comfortable Flux.1-dev fp16, multiple LoRAs, multi-model pipelines. Worth it if you do image gen primarily; less obviously valuable for pure LLM (jump straight to 24 GB if LLM is the goal).

8 GB → 24 GB (used RTX 3090, new RTX 4090)

Cost: $700-900 used 3090 / $1700-2000 used 4090. Benefit: Llama 30B Q4, Qwen2.5-Coder 32B, Llama 70B with offload, Flux training, HunyuanVideo fp16. The "I'm serious about local AI" tier. Used RTX 3090 is the best price/perf jump from 8 GB if you can find a clean one.

Don't upgrade if

Your AI use is occasional (5-10 prompts/week) — cloud is cheaper. Your card is brand new and 8 GB Qwen 2.5 / Flux.1-schnell / LTX-Video already cover your needs. You're waiting for RTX 5060 Ti 16GB (rumored Q3 2026) which would change the value calculus.

What 9bench tells you about your specific 8 GB GPU

Run 9bench.com, scroll to AI Capabilities. The result page shows a workload-feasibility table calibrated for your specific GPU class:

For each model (Llama 7B, Qwen2.5-Coder, Flux.1-dev, HunyuanVideo, etc.) you see YES / MAYBE / NO with concrete tokens-per-second or seconds-per-image. No marketing fluff. Free, no install, 15 seconds.

Common questions

"What about the GTX 1080 Ti 11 GB — it's almost 8 GB?" The 1080 Ti is actually a great-value used pickup at $150-200. 11 GB unlocks Llama 13B Q4 and Flux.1-dev NF4 that 8 GB cards struggle with. No tensor cores so SDXL is ~3× slower than RTX 4060, but for pure LLM the bandwidth (484 GB/s) is closer to 4060 (272 GB/s) than people think.

"My RTX 3070 has 8 GB but should it be faster than 3060 12GB?" For workloads that fit in 8 GB, yes — the 3070 is ~30% faster on raw compute. For workloads that need 12 GB (Llama 13B comfortable, Flux.1-dev), the 3060 12GB wins by being able to actually run them.

"Can I add RTX 4060 + RTX 4060 = 16 GB total?" No — VRAM is per-card, not pooled in consumer-tier setups. You can split a model across two GPUs (each holds half the layers), but each card still needs to fit its half. Two 8 GB cards ≠ one 16 GB card for memory ceiling.

"Will RTX 5060 Ti 16GB make 8 GB cards obsolete?" Rumored Q3 2026 launch at ~$450. If specs hold up (similar to 4060 Ti 16GB but newer arch), it would be the new sweet spot for 16 GB AI workloads. Until then, 8 GB cards remain capable for the workloads listed above.

Find out exactly what your 8 GB card can run — 15 seconds

9bench detects your GPU model, looks up calibrated benchmarks, and shows feasibility for every popular 2026 local AI workload. No install, no signup, no bias.

Test my 8 GB GPU →