Browser-Based LLM Inference with TEA WASM#
The Edge Agent Team
Open Source Project
Abstract#
This article presents TEA WASM LLM, a batteries-included WebAssembly package
for running large language models directly in the browser. We describe the
architecture that bundles the wllama engine internally, eliminating external
npm dependencies. The package provides a simple API (initLlm, chat, embed)
and leverages IndexedDB for model caching, enabling offline-capable LLM inference
with a single import. A live demo demonstrates practical browser-based AI
applications without server-side components.
Keywords: WebAssembly, LLM, Browser, wllama, Offline AI, TEA, Edge Computing
1. Introduction#
The deployment of Large Language Models (LLMs) traditionally requires significant server-side infrastructure, including GPUs, API endpoints, and network connectivity. This creates barriers for privacy-sensitive applications, offline use cases, and edge computing scenarios where network latency is unacceptable.
WebAssembly (WASM) offers an alternative: compile the model runtime to WASM and execute inference directly in the user’s browser. This approach provides:
Privacy: Data never leaves the user’s device
Offline capability: No network required after initial model download
Zero infrastructure: No servers to maintain
Instant deployment: Ship via static hosting (e.g., GitHub Pages)
TEA WASM LLM builds on this foundation with a “batteries-included” philosophy: bundle the wllama engine (llama.cpp for WASM) internally, eliminating the need for users to install or configure external dependencies.
2. Architecture#
2.1 Batteries-Included Design#
The traditional approach to browser-based LLMs requires multiple moving parts:
flowchart LR
subgraph Browser["Browser/Host Page"]
direction LR
APP["Your App"] -->|"callback"| BRIDGE["JS Bridge"]
BRIDGE -->|"API calls"| WLLAMA["wllama\n(npm pkg)\nEXTERNAL"]
end
NPM["npm install\n@wllama/wllama"] -.->|"User must install"| WLLAMA
TEA WASM LLM simplifies this to a single import:
flowchart TB
subgraph Package["tea-wasm-llm (~5-10MB)"]
WASM["TEA WASM Core"]
WLLAMA["wllama.wasm (embedded)"]
LOADER["Model Loader"]
CACHE["IndexedDB Cache"]
end
subgraph Browser["Browser"]
APP["Your App"]
IDB[(IndexedDB)]
end
MODEL["Gemma-3-1b.gguf (~1.3GB)"]
APP -->|"import"| Package
WASM --> WLLAMA
LOADER --> CACHE
CACHE --> IDB
MODEL -.->|"lazy load"| LOADER
2.2 Component Overview#
Component |
Purpose |
Size |
|---|---|---|
TEA WASM Core |
YAML execution, state management |
~500KB |
wllama.wasm |
llama.cpp WASM runtime |
~3-4MB |
Model Loader |
HTTP fetch with progress |
~10KB |
IndexedDB Cache |
Persistent model storage |
~10KB |
Total Package |
All dependencies bundled |
~5-8MB |
2.3 Multi-Threading Support#
Modern browsers support multi-threaded WASM via SharedArrayBuffer, but this requires specific HTTP headers:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
For hosting platforms like GitHub Pages that don’t support custom headers, TEA WASM LLM includes a service worker (coi-serviceworker.js) that intercepts requests and adds these headers.
The package automatically detects threading capability:
import { hasCoopCoep } from 'tea-wasm-llm';
if (hasCoopCoep()) {
console.log('Multi-threaded WASM available');
} else {
console.log('Falling back to single-threaded mode');
}
3. Quick Start#
3.1 Installation#
<script type="module">
import { initLlm, chat, chatStream, embed } from './pkg/index.js';
// Initialize with model URL
await initLlm({
modelUrl: 'https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q8_0.gguf',
onProgress: (loaded, total) => {
console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
},
});
// Simple chat
const response = await chat("What is the capital of France?");
console.log(response.content);
</script>
3.2 Streaming Responses#
For real-time token-by-token output:
await chatStream("Tell me a story about a brave knight", (token) => {
document.getElementById('output').textContent += token;
}, {
maxTokens: 200,
temperature: 0.8,
});
3.3 Embeddings#
Generate vector embeddings for semantic search:
const embedding = await embed("Hello world");
console.log(embedding.vector); // Float32Array
console.log(embedding.dimension); // e.g., 384
4. API Reference#
4.1 Core Functions#
Function |
Description |
Returns |
|---|---|---|
|
Initialize engine and load model |
|
|
Generate completion |
|
|
Stream tokens |
|
|
Generate embeddings |
|
|
Check if initialized |
|
|
Free resources |
|
4.2 Configuration Options#
interface InitLlmConfig {
modelUrl?: string; // URL to GGUF model file
modelBasePath?: string; // Base path for manifest-based loading
onProgress?: (loaded: number, total: number) => void;
onReady?: () => void;
useCache?: boolean; // Default: true
singleThread?: boolean; // Force single-thread mode
nCtx?: number; // Context size (default: 2048)
verbose?: boolean; // Enable logging
}
4.3 Chat Options#
interface ChatOptions {
maxTokens?: number; // Default: 100
temperature?: number; // Default: 0.7
topP?: number; // Default: 0.9
topK?: number; // Default: 0 (disabled)
stop?: string[]; // Stop sequences
system?: string; // System prompt
}
5. Demo Application#
A live demo is available at: TEA WASM LLM Demo
5.1 Features#
Chat Interface: Interactive conversation with Gemma 3 1B
Model Loading Progress: Real-time download progress indicator
Cache Status: Shows cached model size and status
YAML Workflow: Execute TEA YAML workflows with LLM actions
Multi-threading Indicator: Shows threading mode
5.2 Demo Architecture#
docs/wasm-demo/
├── index.html # Main demo page with chat UI
├── app.js # Demo application logic
├── style.css # Responsive styling
├── coi-serviceworker.js # COOP/COEP headers for GitHub Pages
└── README.md # Local development instructions
6. Browser Compatibility#
Browser |
Single-thread |
Multi-thread |
Notes |
|---|---|---|---|
Chrome 90+ |
Yes |
Yes |
Full support |
Firefox 90+ |
Yes |
Yes |
Full support |
Safari 15+ |
Yes |
No |
No SharedArrayBuffer |
Edge 90+ |
Yes |
Yes |
Full support |
6.1 Memory Requirements#
Mode |
Minimum RAM |
Recommended RAM |
|---|---|---|
Loading |
2 GB |
4 GB |
Inference |
3 GB |
6 GB |
7. Comparison with Alternatives#
7.1 vs. Traditional Callback Bridge#
Aspect |
Callback Bridge |
Batteries Included |
|---|---|---|
Dependencies |
@wllama/wllama npm |
None |
Setup code |
~20 lines |
~3 lines |
Failure points |
Multiple |
Single |
Bundle size |
Smaller (but split) |
Larger (but unified) |
7.2 vs. Server-Based APIs#
Aspect |
Server API |
Browser WASM |
|---|---|---|
Privacy |
Data leaves device |
Data stays local |
Latency |
Network dependent |
Zero network |
Cost |
API charges |
Free after download |
Offline |
No |
Yes |
Setup |
API keys required |
Just import |
8. Performance Considerations#
8.1 Initial Load#
First visit: Download ~1.3GB model (cached after)
Subsequent visits: Load from IndexedDB (~5-10 seconds)
WASM compilation: ~2-3 seconds
8.2 Inference Speed#
Approximate token generation speeds on consumer hardware:
CPU |
Single-thread |
Multi-thread |
|---|---|---|
Intel i7 (8th gen) |
~5 tok/s |
~15 tok/s |
Apple M1 |
~8 tok/s |
~25 tok/s |
AMD Ryzen 5 |
~6 tok/s |
~18 tok/s |
Speeds vary based on model size, context length, and browser.
9. Conclusion#
TEA WASM LLM provides a simple, batteries-included solution for browser-based LLM inference. The bundled architecture eliminates dependency management while IndexedDB caching enables efficient repeated use. The combination of privacy-preserving local inference and offline capability opens new possibilities for edge AI applications.
9.1 Future Work#
Smaller quantized models for faster initial load
WebGPU acceleration when broadly available
Persistent conversation memory
Model fine-tuning in browser