What's the best local LLM for 8 GB VRAM in 2026?

Qwen 2.5 7B Q4_K_M is the consensus pick — best quality-per-VRAM in the 7B class, generates 60-100 t/s on RTX 4060. Llama 3.3 8B (released late 2025) is a close second with stronger instruction following. Phi-4-mini (4B) is the speed champion at 120-180 t/s on the same hardware. Avoid Llama 13B Q4 on 8 GB — it works with --lowvram but is slow and unstable.

Can I run Flux.1 on 8 GB VRAM?

Flux.1-schnell yes (~2-3 sec on RTX 4090, 8-15 sec on RTX 4060). Flux.1-dev is tight: NF4-quantized peaks at ~10 GB so you need --lowvram or sequential CPU offload. Generation takes 60-180 seconds vs 12-18 seconds on 12 GB cards. Recommendation: stick with Flux.1-schnell on 8 GB.

Is RTX 3060 12GB worth $200 over RTX 4060 8GB at $300?

For local AI: absolutely yes. RTX 3060 12GB used (~$200) handles Llama 13B Q4, Flux.1-dev NF4, full SDXL pipelines, and HunyuanVideo Q4 — all impossible or painful on RTX 4060 8GB. Speed-wise the 4060 is slightly faster on small models, but VRAM ceiling matters far more for AI workloads. Buy RTX 3060 12GB if AI is the goal.

Does --medvram or --lowvram actually work?

Yes, but with real cost. --medvram in Automatic1111/Forge offloads parts of the model between VRAM and RAM. You lose 30-50% speed but enable workloads that wouldn't fit. --lowvram is more aggressive, slower (50-70% slower), but lets you run Flux.1-dev on 8 GB. Use them as last resort, not default — and never benchmark with them on (numbers will look terrible).

Can I do Stable Diffusion XL on RTX 4060 8GB?

Yes, comfortably without refiner; refiner needs --medvram or it OOMs. Generation: 14-22 seconds per 1024×1024 image with refiner disabled. With refiner + --medvram: 25-40 seconds. Better to use Flux.1-schnell for fast iteration on 8 GB cards.

What about training / fine-tuning on 8 GB VRAM?

LoRA fine-tuning of 7B models is borderline — works with rank-8 LoRA + batch size 1 at ~3-5 hours per epoch. Not pleasant. QLoRA on 7B works better. Full fine-tuning is impossible on 8 GB. For serious LoRA training, target 12 GB+ minimum.

Run Local LLMs on 8 GB VRAM in 2026: The Honest Reality Check

TL;DR — What 8 GB VRAM actually buys you in 2026

Comfortable: Qwen 2.5 7B / Llama 3.3 8B (60-100 t/s), Phi-4-mini (120-180 t/s), SDXL no-refiner, Flux.1-schnell, Whisper Large-v3, Qwen2-VL 7B, LTX-Video
Tight (--medvram needed): Llama 13B Q4, SDXL+refiner, Flux.1-dev NF4 (60-180s), HunyuanVideo Q4 (slow)
Painful: Llama 30B (don't bother), Mistral Small 22B Q4 (very slow with offload)
Impossible: Llama 70B, Flux.1-dev fp16, full fine-tuning

8 GB VRAM is the 2026 floor for usable local AI. Not the comfort zone — but you can do a surprising amount with it. Test your specific GPU →

"I have an RTX 4060 / 3060 / 3070, what can I run?" is the most-asked question on r/LocalLLaMA in 2026. The honest answer involves a lot of "yes, but" qualifications. This article gives you the unvarnished version — what 8 GB VRAM actually delivers, where it falls apart, and which workarounds are worth using.

All numbers are calibrated against public llama.cpp / ComfyUI benchmarks plus the 9bench GPU-class lookup table for RTX 4060, RTX 3060 8GB, RTX 3070, and AMD RX 7600.

📐 What "8 GB VRAM" cards we're testing

Throughout this article, "8 GB" means the popular consumer cards: RTX 4060 (8 GB GDDR6), RTX 3060 8GB, RTX 4060 Laptop, RTX 3070, RX 7600, RX 6600, Apple M2/M3 base 8 GB unified. Speed varies between them but VRAM ceilings are identical.

What 8 GB runs comfortably (the good news)

1. 7B / 8B class LLMs — the sweet spot

Modern 7-9B models with Q4 quantization fit in 5-6 GB peak VRAM, leaving 2-3 GB headroom for KV cache and OS overhead. Performance on RTX 4060:

Qwen 2.5 7B Q4_K_M: 60-100 tokens/sec — the 2026 quality champion
Llama 3.3 8B Q4_K_M: 55-95 t/s — strong instruction-following
Mistral 7B Q4_K_M: 65-100 t/s — older but still solid
Phi-4 (14B) Q4_K_M: tight at 7-8 GB peak; 30-45 t/s; might need --lowvram
Phi-4-mini (4B) Q4: 120-180 t/s — speed champion, good for autocomplete
Gemma 2 9B Q4: tight at 7 GB peak; 45-70 t/s

For chat / RAG / code-completion / summarization use cases, 7B-class models running on 8 GB VRAM are genuinely competitive with cloud GPT-3.5-class APIs. The era of "you need a 4090 to do anything" is over.

2. SDXL without refiner — fine

SDXL base model alone fits in 7-8 GB VRAM. RTX 4060 generates 14-22 seconds per 1024×1024 image. Slower than RTX 4090's 3-5s but very usable.

Skip the refiner — it doubles peak VRAM and would force --medvram. The base model alone produces solid images. If you absolutely need refiner quality, ComfyUI with sequential CPU offload works at 25-40s per image.

3. Flux.1-schnell — yes, on 8 GB!

Flux.1-schnell is the speed-optimized 4-step variant of Flux.1-dev. Designed for 8 GB cards. Performance:

RTX 4060: 8-15 sec per 1024×1024 image
RTX 3060 8GB: 12-20 sec
RTX 3070: 6-10 sec
RX 7600: 14-25 sec (Vulkan path)

Quality is slightly below Flux.1-dev but dramatically better than SDXL. For 8 GB owners who want the modern image-gen experience, Flux.1-schnell is the answer.

4. Whisper transcription — easy

Whisper Large-v3 (1.5B params) uses 3-4 GB VRAM. Whisper Small uses ~1 GB. Both run comfortably with massive headroom on 8 GB. RTX 4060 transcribes 1 hour of audio in ~5-7 minutes. Fine for podcast/meeting transcription.

5. Qwen2-VL 7B — multimodal works

Qwen2-VL 7B (vision-language model) needs ~6-7 GB VRAM. Fits comfortably on 8 GB. RTX 4060 runs OCR / image-description / screenshot-reading at 30-50 tokens/sec. For local agentic workflows that need to "see" the screen, this is the model that actually works on consumer hardware in 2026.

6. LTX-Video — yes, video generation on 8 GB!

The most surprising 2026 capability: local AI video generation on 8 GB cards. LTX-Video (Lightricks 2B) generates 5-second 768×512 clips in:

RTX 4060: 60-100 sec per 5s clip
RTX 3070: 50-80 sec
RTX 4060 Laptop: 90-140 sec

Quality isn't HunyuanVideo level, but for fast iteration / drafts / b-roll, an 8 GB card is enough to prototype video AI workflows.

What's tight on 8 GB (the "yes but" zone)

1. Llama 13B Q4 — works with --lowvram

13B Q4 weights are 7.4 GB. Add KV cache and you peak at 9-10 GB — over the 8 GB ceiling. Workarounds:

llama.cpp with --gpu-layers tuned (push some layers to CPU): 8-15 t/s instead of native 25-35 t/s. ~50% slowdown but stable.
Smaller context (2K instead of 8K): reduces KV cache, sometimes enough to fit fully on GPU.
Q3_K_M instead of Q4_K_M: 5.5 GB weights, fits in 8 GB, ~5% quality drop. Decent trade.

Honest verdict: 13B Q4 on 8 GB is "it works, sort of". If you really want 13B comfortable, the path is to upgrade to 12 GB+, not to fight --lowvram.

2. Flux.1-dev — possible but painful

NF4-quantized Flux.1-dev peaks at ~10 GB during inference. On 8 GB you need:

ComfyUI sequential CPU offload: works, slow. 60-180 sec per image (vs 12-18 on a 12 GB card).
Forge with --lowvram: similar penalty.
Tile VAE node: helps with peak VRAM during the VAE decode step specifically.

Better path: use Flux.1-schnell for daily work, switch to cloud (Replicate, Hugging Face Spaces ~$0.01/image) for the rare Flux.1-dev quality job. Don't burn time on workarounds.

3. SDXL with refiner

Refiner adds ~3-4 GB peak VRAM. With base + refiner + ControlNet + LoRA, you're at 11-12 GB peak which doesn't fit. Workaround: --medvram in Forge handles this at 25-40 sec per image instead of native 5-8s on a 12 GB card. Usable but slow.

4. HunyuanVideo Q4 — works, very slow

HunyuanVideo Q4 peaks at ~11-12 GB. On 8 GB you'd need aggressive offload — generation times balloon to 15-30 minutes per 5-second clip. Realistically, skip HunyuanVideo on 8 GB cards. Use LTX-Video (designed for 8 GB) instead.

What's painful on 8 GB (don't bother)

1. Llama 30B / Mistral Small 22B / Codestral 22B

Mid-tier models (22B-32B) are the awkward zone. Q4 weights are 13-19 GB, way over 8 GB. Heavy CPU offload makes them generate at 1-3 t/s — slower than human reading speed. Solution: stick to 7B-13B class on 8 GB, save 30B+ for when you upgrade.

2. Qwen2.5-Coder 32B (the local-coding-agent flagship)

Q4 weights alone are 19 GB. There's no realistic 8 GB path. If you want local coding agents, you need 24 GB VRAM (RTX 3090 used $700-900, RTX 4090, M3 Max 36GB+). Don't try to force Qwen2.5-Coder onto 8 GB — DeepSeek-Coder-V2-Lite (16B MoE, ~12 GB) or smaller models like Qwen2.5-Coder 7B are the realistic alternatives.

3. Full fine-tuning of any model

Full fine-tuning needs 4-6× the model size in VRAM (gradients + optimizer states). Even 7B full fine-tuning needs 30+ GB. Use cloud GPUs for this — RunPod / Together rents A100 40GB at $1-2/hour.

What's impossible on 8 GB (no amount of workaround helps)

Llama 70B / Qwen 72B — 40+ GB weights, impossible
Flux.1-dev fp16 — needs 22-24 GB
HunyuanVideo bf16 — needs 22-24 GB
Llama 3.2 Vision 90B — 52 GB weights
Pixtral 12B fp16 — needs 16 GB
Loading 2 models simultaneously (e.g., LLM + image gen pipeline)

The 8 GB workflow that actually works

Stop fighting the VRAM ceiling. Pick the right tools for 8 GB:

Default LLM: Qwen 2.5 7B Q4_K_M for chat / RAG / code, Phi-4-mini Q4 for autocomplete / fast tasks. Both via Ollama or LM Studio.
Default image gen: Flux.1-schnell for new images, SDXL base (no refiner) for variety. ComfyUI is the easiest UI; Forge is fine too.
Default video gen: LTX-Video. Don't even try HunyuanVideo on 8 GB.
Default vision: Qwen2-VL 7B Q4. For OCR / screenshot agents / image descriptions.
Default audio: Whisper Large-v3 for transcription, F5-TTS for voice cloning.
For 13B+ or Flux fp16: rent a cloud GPU for the rare job. Don't waste hours on --lowvram workarounds for 5 generations.

⚠️ The browser-shared-RAM trap (laptops)

Many laptops with "8 GB VRAM" are actually using shared system RAM (Intel Iris Xe, low-tier AMD APUs). These have nothing close to 8 GB usable VRAM in practice — typically 4-6 GB max after driver overhead. If your laptop has only one RAM number listed (e.g., "16 GB unified"), you're in shared-memory territory. Treat it as 4 GB VRAM, not 8.

Should you upgrade from 8 GB?

Cost-benefit framework for upgrading specifically from 8 GB to 12 / 16 / 24 GB:

8 GB → 12 GB (e.g., RTX 4060 → used RTX 3060 12GB)

Cost: ~$200 used 3060 12GB. Benefit: unlocks Llama 13B Q4 comfortable, SDXL + refiner, Flux.1-dev NF4 (12-25s vs 60-180s), HunyuanVideo Q4. Massive value-per-dollar. If AI is a primary use case and your card is 8 GB, this is the highest-ROI upgrade in consumer hardware.

8 GB → 16 GB (RTX 4060 Ti 16GB / RX 7800 XT)

Cost: ~$450-550 new. Benefit: comfortable Flux.1-dev fp16, multiple LoRAs, multi-model pipelines. Worth it if you do image gen primarily; less obviously valuable for pure LLM (jump straight to 24 GB if LLM is the goal).

8 GB → 24 GB (used RTX 3090, new RTX 4090)

Cost: $700-900 used 3090 / $1700-2000 used 4090. Benefit: Llama 30B Q4, Qwen2.5-Coder 32B, Llama 70B with offload, Flux training, HunyuanVideo fp16. The "I'm serious about local AI" tier. Used RTX 3090 is the best price/perf jump from 8 GB if you can find a clean one.

Don't upgrade if

Your AI use is occasional (5-10 prompts/week) — cloud is cheaper. Your card is brand new and 8 GB Qwen 2.5 / Flux.1-schnell / LTX-Video already cover your needs. You're waiting for RTX 5060 Ti 16GB (rumored Q3 2026) which would change the value calculus.

What 9bench tells you about your specific 8 GB GPU

Run 9bench.com, scroll to AI Capabilities. The result page shows a workload-feasibility table calibrated for your specific GPU class:

RTX 4060 → calibrated for actual RTX 4060 llama.cpp / ComfyUI numbers
RTX 3060 8GB → calibrated separately from the 12 GB variant
RTX 3070 → calibrated
RTX 4060 Laptop → calibrated (different from desktop 4060!)
RX 7600 → calibrated for ROCm/Vulkan path

For each model (Llama 7B, Qwen2.5-Coder, Flux.1-dev, HunyuanVideo, etc.) you see YES / MAYBE / NO with concrete tokens-per-second or seconds-per-image. No marketing fluff. Free, no install, 15 seconds.

Common questions

"What about the GTX 1080 Ti 11 GB — it's almost 8 GB?" The 1080 Ti is actually a great-value used pickup at $150-200. 11 GB unlocks Llama 13B Q4 and Flux.1-dev NF4 that 8 GB cards struggle with. No tensor cores so SDXL is ~3× slower than RTX 4060, but for pure LLM the bandwidth (484 GB/s) is closer to 4060 (272 GB/s) than people think.

"My RTX 3070 has 8 GB but should it be faster than 3060 12GB?" For workloads that fit in 8 GB, yes — the 3070 is ~30% faster on raw compute. For workloads that need 12 GB (Llama 13B comfortable, Flux.1-dev), the 3060 12GB wins by being able to actually run them.

"Can I add RTX 4060 + RTX 4060 = 16 GB total?" No — VRAM is per-card, not pooled in consumer-tier setups. You can split a model across two GPUs (each holds half the layers), but each card still needs to fit its half. Two 8 GB cards ≠ one 16 GB card for memory ceiling.

"Will RTX 5060 Ti 16GB make 8 GB cards obsolete?" Rumored Q3 2026 launch at ~$450. If specs hold up (similar to 4060 Ti 16GB but newer arch), it would be the new sweet spot for 16 GB AI workloads. Until then, 8 GB cards remain capable for the workloads listed above.

Find out exactly what your 8 GB card can run — 15 seconds

9bench detects your GPU model, looks up calibrated benchmarks, and shows feasibility for every popular 2026 local AI workload. No install, no signup, no bias.

Test my 8 GB GPU →

Run Local LLMs on 8 GB VRAM in 2026: The Honest Reality Check

What 8 GB runs comfortably (the good news)

1. 7B / 8B class LLMs — the sweet spot

2. SDXL without refiner — fine

3. Flux.1-schnell — yes, on 8 GB!

4. Whisper transcription — easy

5. Qwen2-VL 7B — multimodal works

6. LTX-Video — yes, video generation on 8 GB!

What's tight on 8 GB (the "yes but" zone)

1. Llama 13B Q4 — works with --lowvram

2. Flux.1-dev — possible but painful

3. SDXL with refiner

4. HunyuanVideo Q4 — works, very slow

What's painful on 8 GB (don't bother)

1. Llama 30B / Mistral Small 22B / Codestral 22B

2. Qwen2.5-Coder 32B (the local-coding-agent flagship)

3. Full fine-tuning of any model

What's impossible on 8 GB (no amount of workaround helps)

The 8 GB workflow that actually works

Should you upgrade from 8 GB?

8 GB → 12 GB (e.g., RTX 4060 → used RTX 3060 12GB)

8 GB → 16 GB (RTX 4060 Ti 16GB / RX 7800 XT)

8 GB → 24 GB (used RTX 3090, new RTX 4090)

Don't upgrade if

What 9bench tells you about your specific 8 GB GPU

Common questions

Find out exactly what your 8 GB card can run — 15 seconds

Test your hardware in 15 seconds

Frequently asked

What 8 GB runs comfortably (the good news)

1. 7B / 8B class LLMs — the sweet spot

2. SDXL without refiner — fine

3. Flux.1-schnell — yes, on 8 GB!

4. Whisper transcription — easy

5. Qwen2-VL 7B — multimodal works

6. LTX-Video — yes, video generation on 8 GB!

What's tight on 8 GB (the "yes but" zone)

1. Llama 13B Q4 — works with --lowvram

2. Flux.1-dev — possible but painful

3. SDXL with refiner

4. HunyuanVideo Q4 — works, very slow

What's painful on 8 GB (don't bother)

1. Llama 30B / Mistral Small 22B / Codestral 22B

2. Qwen2.5-Coder 32B (the local-coding-agent flagship)

3. Full fine-tuning of any model

What's impossible on 8 GB (no amount of workaround helps)

The 8 GB workflow that actually works

Should you upgrade from 8 GB?

8 GB → 12 GB (e.g., RTX 4060 → used RTX 3060 12GB)

8 GB → 16 GB (RTX 4060 Ti 16GB / RX 7800 XT)

8 GB → 24 GB (used RTX 3090, new RTX 4090)

Don't upgrade if

What 9bench tells you about your specific 8 GB GPU

Common questions

Find out exactly what your 8 GB card can run — 15 seconds

Test your hardware in 15 seconds

Frequently asked

Related articles