How much slower is browser LLM vs native?

On the same hardware, browser-based LLM inference (transformers.js, WebLLM) is typically 5-10× slower than native llama.cpp / Ollama for token generation. An RTX 4090 that does ~150 tokens/s on native llama.cpp lands around 15-25 tokens/s in browser. The gap is mostly WebAssembly overhead and limited memory bandwidth in the browser sandbox, not GPU compute differences.

Why is browser LLM slower if it uses WebGPU?

Three reasons. First, WebAssembly (used for parts of the inference loop) is roughly 1.5-2× slower than native compiled code. Second, WebGPU has restrictions on memory layout and kernel sizes that prevent some optimisations llama.cpp uses. Third, browsers cap memory allocation per tab (typically 4-8 GB), which forces smaller batch sizes. Together these add up to the 5-10× gap.

Is browser LLM ever faster than native?

Almost never on the same hardware. The exceptions: (1) native CPU-only inference on a system with weak GPU vs browser using a strong GPU via WebGPU — here browser can win because it accesses the GPU. (2) Very small models (DistilGPT-2, Phi-1.5) where the overhead is amortised. For any serious model (7B+), native always wins.

When should I use browser LLM despite the slower speed?

Browser LLM is the right answer when (a) you don't want users to install anything, (b) privacy must be guaranteed (the model never leaves the browser), (c) deployment is across diverse OSes (web works everywhere), or (d) the use case is interactive UI assistance where 10 tokens/s is enough. Examples: a writing assistant on a website, a code-completion sidebar, a customer-support chatbot that runs client-side. For batch processing or coding agents that need 50+ t/s, use native.

Will browser LLM ever catch up to native?

Partially, not fully. WebGPU gets faster every year and Wasm SIMD keeps closing the gap, but native code will always have access to optimisations that browsers refuse to allow for security reasons (raw memory access, unrestricted kernel sizes, vendor-specific extensions). Realistic 2027 expectation: browser LLM at ~3× slower than native, not 5-10×. Still a meaningful gap.

Browser vs Native LLM Performance: Why transformers.js Is 5-10× Slower Than llama.cpp

TL;DR — The browser-vs-native LLM gap, explained

On identical hardware, browser-based LLM inference (transformers.js, WebLLM) generates tokens 5-10× slower than native llama.cpp / Ollama. RTX 4090: ~150 t/s native vs ~15-25 t/s browser. Main causes: WebAssembly overhead (1.5-2×), WebGPU layout restrictions, browser memory caps. Browser is still the right choice when you need install-free deployment, guaranteed privacy, or cross-platform reach. Test it live: 9bench browser LLM test.

9bench includes a "live LLM test" that runs a small open-weight LLM in your browser (Phi-3-mini on machines with ~3.5 GB+ browser memory headroom; Qwen 0.5B or DistilGPT-2 on memory-constrained devices) and reports real tokens/second. The first thing users notice: the number is much lower than what they see in Ollama or LM Studio on the same machine. RTX 4090 owners report 15-25 tokens/s on 9bench browser test vs 100-160 on llama.cpp. That gap is real and not a 9bench bug.

This article explains why browser LLM is slower, by how much, when the browser is fast enough anyway, and what changes over the next 12-24 months.

The headline numbers

Measured on identical hardware, same model (Llama 7B Q4_K_M), same prompt, default settings:

Hardware	Native (llama.cpp)	Browser (transformers.js)	Slowdown
RTX 4090 (24 GB)	100-160 t/s	15-25 t/s	~6-7×
RTX 4070 Super (12 GB)	50-75 t/s	10-18 t/s	~5×
RTX 4060 Laptop (8 GB)	25-45 t/s	5-12 t/s	~5×
Apple M3 Max	50-80 t/s (MLX)	8-15 t/s	~6-8×
Apple M2 (base)	20-35 t/s (MLX)	3-7 t/s	~5-6×
Ryzen 7840U iGPU (Radeon 780M)	8-18 t/s (Vulkan)	2-4 t/s	~4-5×
Intel Iris Xe iGPU	5-12 t/s (CPU+iGPU)	1.5-3 t/s	~3-4×

Two things stand out. First, the slowdown ratio is fairly consistent at 4-8× across all hardware classes — it's not a quirk of one chip. Second, weaker hardware has slightly less penalty (3-4× on iGPUs, 6-7× on top-end discrete) because the WebAssembly overhead becomes less dominant when GPU isn't the bottleneck.

📐 Methodology note

Numbers above are medians from llama.cpp GitHub issues (#4167, #5950, #7062), the LM Studio public leaderboard, MLX-Examples README benchmarks, and 9bench's own browser-LLM test aggregated across 2000+ submissions. Same prompt (~30 tokens in, 200 tokens out), Q4_K_M quantisation, default sampler. Your specific number can swing ±20% with PSU, thermals, background processes, and OS scheduler.

Why is the browser slower? Five concrete reasons

1. WebAssembly overhead (1.5-2× slower than native code)

WebAssembly (Wasm) is a portable bytecode that browsers JIT-compile to machine code. Wasm SIMD adds vectorised ops which help, but Wasm code still runs slower than equivalent native C++ for several reasons:

Bounds checking on memory access (security requirement)
Limited register pressure visibility — JIT can't always allocate optimally
No vendor-specific intrinsics (AVX-512, NEON-Dotprod, AMX) that llama.cpp uses heavily
No tail-call optimisation in many engines

Wasm-SIMD-vs-native AVX2 benchmarks consistently show 1.5-2× slowdown. For the parts of LLM inference that run on CPU (tokenisation, sampling, KV-cache management), this is the dominant cost.

2. WebGPU has stricter constraints than CUDA / Metal / Vulkan

WebGPU is the new standard, and impressive — but it deliberately restricts certain operations for security and portability. Specifically for LLM:

Workgroup memory size limited to 16 KB on most GPUs (CUDA allows 48-100 KB depending on chip)
No persistent kernel programs — every dispatch reinitialises state
FP16 (shader-f16) is optional and missing on many laptops/older GPUs. Without it, matmul runs in FP32 → 2× memory bandwidth, 2× compute
No int8 or int4 dot products at the WebGPU level (yet — coming in WGSL extensions)
No tensor cores directly accessible — you get general-purpose shaders only

These add up. A native llama.cpp build for an RTX 4090 uses CUDA tensor cores in FP16 with int4 weight de-quantisation kernels. The WebGPU equivalent runs general-purpose FP16 (if available) without tensor-core acceleration. That alone explains 2-3× of the gap.

3. Memory bandwidth bottleneck (browser caps + sandbox cost)

LLM inference is bandwidth-bound. Each generated token requires reading every weight from VRAM. The faster you can read, the faster tokens come out.

In the browser:

WebGPU has to validate every buffer transfer (security)
Browser tabs are typically capped at 4-8 GB of GPU memory by the OS
WebGPU buffer reads may be coalesced through CPU-side staging in some implementations
The browser process and the GPU process are separate — calls cross IPC boundary

Native llama.cpp has direct, unlimited GPU memory access. It can map the entire model into pinned VRAM and read straight from compute kernels. The browser must marshal through layers. On a 4090, the effective bandwidth available to a browser tab is closer to 60-70% of what native gets.

4. No KV-cache optimisations browser-side (yet)

Modern LLM serving uses sophisticated KV-cache strategies: paged attention, sliding-window cache, Flash-Attention 2/3 for long contexts. These are critical when generating many tokens or handling long prompts.

transformers.js as of early 2026 implements basic KV cache but not paged attention, no Flash-Attention. WebLLM has slightly better KV handling. Native llama.cpp has all of them.

For short responses (200 tokens), this matters less. For long generations (2000+ tokens) or long-context conversations, the gap widens.

5. Initial load time + warm-up cost

Native LLM tools download the model once and load it into memory. Subsequent runs reuse the loaded weights.

Browser LLM:

First visit: download model from Hugging Face CDN (1.7 GB for Phi-3-mini, 4.5 GB for Llama 7B Q4)
Subsequent visits: weights cached in browser via IndexedDB (good!) but JIT compile of WebAssembly + WGSL shaders happens on every page load (5-15 seconds)
First token after page load is much slower than subsequent (warm-up cost)

For "open the page, ask one question, leave" use cases, this initial cost dominates the actual inference cost. tokens/s during steady-state generation is the right thing to compare, which is what 9bench measures.

When the browser is fast enough anyway

Slower than native isn't the same as too slow. For many use cases, browser LLM is plenty.

Use case: interactive writing assistance (10+ t/s is enough)

A blog editor that suggests sentence completions doesn't need 100 t/s. It needs 5-15 t/s consistently, with low latency on the first token. Browser LLM hits this on any GPU since 2020.

Notable examples shipping in 2025-2026: Notion AI's optional client-side mode, GitHub Copilot's experimental "private browser inference", several form-autofill services.

Use case: privacy-sensitive client-side AI (no upload allowed)

Healthcare apps, legal documents, financial planning tools — anywhere the user data must not leave the device. Browser LLM is the only option that keeps data on the client without requiring the user to install software.

The privacy win is total: model weights download to the browser, inference happens locally, nothing about the user's input ever touches a server. No "we promise we won't train on your data" trust required.

Use case: cross-platform deployment (mobile + desktop + Chromebook)

Native LLM tools require platform-specific builds. Mobile is especially hard (iOS forbids JIT, complicating Wasm; Android has fragmented GPU support). Browser LLM runs anywhere with a modern Chromium.

9bench itself is the demo: same browser code runs on Windows desktop, Mac, Linux, Chromebook, Android tablet, iPad. We didn't ship 6 builds. We shipped one URL.

Use case: educational / demo / "is local AI possible on my machine?"

For users who don't know what an LLM is, asking them to install Ollama is a non-starter. Loading a webpage and clicking a button isn't. Browser LLM at 5-15 t/s is dramatically more accessible than native LLM at 50-100 t/s — because most users will never get to the 50-100.

When the browser is NOT enough

Use native (llama.cpp / Ollama / LM Studio / MLX) when:

Coding agents / autonomous loops — these often need 50+ t/s for usable iteration speed. Browser caps you at ~20 t/s on top hardware.
Long-context tasks (32K+ tokens) — browser KV cache handling falls behind hard above ~4K context.
Batch processing — running 100 queries through a model. Native batches efficiently; browser does not.
70B+ models — browser memory caps prevent loading. You're stuck under 8B even on a 4090.
Fine-tuning / training — basically not viable in browser at this point. Use native.

Rule of thumb: if you'd hit "stop generation" and edit the model output mid-stream, browser is fine. If you'd kick off a 30-minute run and walk away, use native.

How 9bench reports browser-vs-native — and why

We built two pieces specifically to address this confusion:

Browser score (measured): What 9bench actually observes in your browser. This is the honest "what you'd get from a Phi-3-mini chat in this tab right now" number.
Native estimate (calibrated): Looked up from a curated GPU-class table (50+ entries) based on your detected GPU. This is what llama.cpp / Ollama would generate on the same hardware.

Both numbers appear side-by-side on the result page, with the gap explained. Goal: nobody walks away thinking "9bench says my 4090 only does 20 t/s, the marketing was lying" when the real story is "browser does 20, native does 150, and both numbers are correct".

📊 Try it yourself

Run 9bench, scroll to AI Capabilities, click "Run live AI test". Wait ~60 seconds for the first run (model downloads). The result page shows your browser tokens/s + a calibrated native estimate. You'll see the gap on your specific hardware.

What changes in 2026-2027

The browser-vs-native gap is shrinking. Three forces are pushing it down:

WebGPU 2026 features (FP16 baseline, subgroup ops, FP8)

WebGPU spec is adding subgroup operations (warp-level primitives equivalent to CUDA's __shfl) and broader FP8 support. These were the missing primitives needed for tensor-core-class kernels in browser. Chrome 132+ has experimental support; full rollout expected late 2026. Impact: 1.5-2× speedup on top-end GPUs.

WebGPU Tensor extension (proposal)

Direct tensor-core access via a WebGPU extension is in early proposal stage. If it lands (2027 realistic), browser LLM on RTX/H100/M-series would close most of the remaining gap. "If" is doing real work in that sentence — extensions like this often take 2-3 years.

Better browser-native ML libraries

transformers.js v3 (early 2026) added Wasm SIMD threadpool optimisations. WebLLM is shipping paged attention. Microsoft's ONNX Runtime Web is integrating Flash-Attention. Each of these is a 1.2-1.5× compounding speedup.

Stacked: by mid-2027, expect browser LLM at ~3× slower than native (vs 5-10× today). That's the realistic ceiling. Native will always have some edge because of unrestricted memory and vendor extensions.

Practical recommendations

If you're a user: use native (Ollama, LM Studio) for serious work. Use browser LLM (9bench, demos on websites) for triage and "does this work on my machine" checks.

If you're building a product:

Privacy-critical or wide-distribution → browser LLM. Accept the slowdown; it's worth it for the deployment story.
Power-user tooling → native. Ship as Tauri/Electron app or CLI. Don't apologise for "users have to install".
Hybrid → browser as fallback, native as accelerator. Best UX: works on any device, 10× faster on power-user setups.

If you're benchmarking: always cite which mode you measured. "RTX 4090 does X tokens/s" is not a complete sentence. "RTX 4090 does X t/s on browser via 9bench" or "Y t/s on llama.cpp native" is.

Common questions

"Can I run Llama 70B in the browser?" Technically yes via WebLLM with model sharding. Practically: no. Memory caps prevent it on most browsers. Even on Chrome with flags enabled, 70B Q4 (~40 GB) exceeds the per-tab limit. Stick to 7B-13B in browser.

"Why does the first run take so long?" First run = downloading model weights (1-5 GB) + JIT-compiling Wasm/WGSL kernels. Subsequent runs in the same browser reuse cached weights and warm shaders. Typical speedup: 5-10× on second run. 9bench shows this in the load-time breakdown.

"Does WebGPU work on iPhone Safari?" WebGPU is enabled by default in Safari 17.4+ (iOS 17.4+). Most iPhone 14+ devices can run small LLMs via transformers.js. iPad Pro M-series chips run them comfortably. Older iPhones and iOS 16- fall back to Wasm-only, much slower.

"Is WebLLM faster than transformers.js?" Slightly, in some cases. WebLLM (from MLC) implements paged attention and uses TVM-compiled kernels. transformers.js is more general-purpose and supports more model architectures. For Llama-class models, WebLLM is ~20-40% faster. For general experimentation, transformers.js has wider model support.

Run a live LLM test in your browser — 60 seconds

9bench downloads a small open-weight LLM (Phi-3-mini on most machines, smaller fallbacks for memory-constrained browsers) and measures real tokens/second on your hardware. Compare against the native estimate on your result page. No install, no account, model cached after first run for instant subsequent tests.

Test browser LLM speed →

Browser vs Native LLM Performance: Why transformers.js Is 5-10× Slower Than llama.cpp

The headline numbers

Why is the browser slower? Five concrete reasons

1. WebAssembly overhead (1.5-2× slower than native code)

2. WebGPU has stricter constraints than CUDA / Metal / Vulkan

3. Memory bandwidth bottleneck (browser caps + sandbox cost)

4. No KV-cache optimisations browser-side (yet)

5. Initial load time + warm-up cost

When the browser is fast enough anyway

Use case: interactive writing assistance (10+ t/s is enough)

Use case: privacy-sensitive client-side AI (no upload allowed)

Use case: cross-platform deployment (mobile + desktop + Chromebook)

Use case: educational / demo / "is local AI possible on my machine?"

When the browser is NOT enough

How 9bench reports browser-vs-native — and why

What changes in 2026-2027

WebGPU 2026 features (FP16 baseline, subgroup ops, FP8)

WebGPU Tensor extension (proposal)

Better browser-native ML libraries

Practical recommendations

Common questions

Run a live LLM test in your browser — 60 seconds

Test your hardware in 15 seconds

Frequently asked

The headline numbers

Why is the browser slower? Five concrete reasons

1. WebAssembly overhead (1.5-2× slower than native code)

2. WebGPU has stricter constraints than CUDA / Metal / Vulkan

3. Memory bandwidth bottleneck (browser caps + sandbox cost)

4. No KV-cache optimisations browser-side (yet)

5. Initial load time + warm-up cost

When the browser is fast enough anyway

Use case: interactive writing assistance (10+ t/s is enough)

Use case: privacy-sensitive client-side AI (no upload allowed)

Use case: cross-platform deployment (mobile + desktop + Chromebook)

Use case: educational / demo / "is local AI possible on my machine?"

When the browser is NOT enough

How 9bench reports browser-vs-native — and why

What changes in 2026-2027

WebGPU 2026 features (FP16 baseline, subgroup ops, FP8)

WebGPU Tensor extension (proposal)

Better browser-native ML libraries

Practical recommendations

Common questions

Run a live LLM test in your browser — 60 seconds

Test your hardware in 15 seconds

Frequently asked

Related articles