Browser-Based LLM Inference with TEA WASM#

Fabricio Ceolin

Independent Researcher

https://www.linkedin.com/in/fabceolin/

Abstract#

This article presents TEA WASM LLM, a batteries-included WebAssembly package for running large language models directly in the browser. We describe the architecture that bundles the wllama engine internally, eliminating external npm dependencies. The package provides a simple API (initLlm, chat, embed) and leverages IndexedDB for model caching, enabling offline-capable LLM inference with a single import. A live demo demonstrates practical browser-based AI applications without server-side components.

Keywords: WebAssembly, LLM, Browser, wllama, Offline AI, TEA, Edge Computing

1. Introduction#

The deployment of Large Language Models (LLMs) traditionally requires significant server-side infrastructure, including GPUs, API endpoints, and network connectivity. This creates barriers for privacy-sensitive applications, offline use cases, and edge computing scenarios where network latency is unacceptable.

WebAssembly (WASM) offers an alternative: compile the model runtime to WASM and execute inference directly in the user’s browser. This approach provides:

Privacy: Data never leaves the user’s device
Offline capability: No network required after initial model download
Zero infrastructure: No servers to maintain
Instant deployment: Ship via static hosting (e.g., GitHub Pages)

TEA WASM LLM builds on this foundation with a “batteries-included” philosophy: bundle the wllama engine (llama.cpp for WASM) internally, eliminating the need for users to install or configure external dependencies.

2. Architecture#

2.1 Batteries-Included Design#

The traditional approach to browser-based LLMs requires multiple moving parts:

        flowchart LR
    subgraph Browser["Browser/Host Page"]
        direction LR
        APP["Your App"] -->|"callback"| BRIDGE["JS Bridge"]
        BRIDGE -->|"API calls"| WLLAMA["wllama\n(npm pkg)\nEXTERNAL"]
    end

    NPM["npm install\n@wllama/wllama"] -.->|"User must install"| WLLAMA

TEA WASM LLM simplifies this to a single import:

        flowchart TB
    subgraph Package["tea-wasm-llm (~5-10MB)"]
        WASM["TEA WASM Core"]
        WLLAMA["wllama.wasm (embedded)"]
        LOADER["Model Loader"]
        CACHE["IndexedDB Cache"]
    end

    subgraph Browser["Browser"]
        APP["Your App"]
        IDB[(IndexedDB)]
    end

    MODEL["Gemma-3-1b.gguf (~1.3GB)"]

    APP -->|"import"| Package
    WASM --> WLLAMA
    LOADER --> CACHE
    CACHE --> IDB
    MODEL -.->|"lazy load"| LOADER

2.2 Component Overview#

Component	Purpose	Size
TEA WASM Core	YAML execution, state management	~500KB
wllama.wasm	llama.cpp WASM runtime	~3-4MB
Model Loader	HTTP fetch with progress	~10KB
IndexedDB Cache	Persistent model storage	~10KB
Total Package	All dependencies bundled	~5-8MB

2.3 Multi-Threading Support#

Modern browsers support multi-threaded WASM via SharedArrayBuffer, but this requires specific HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

For hosting platforms like GitHub Pages that don’t support custom headers, TEA WASM LLM includes a service worker (coi-serviceworker.js) that intercepts requests and adds these headers.

The package automatically detects threading capability:

import { hasCoopCoep } from 'tea-wasm-llm';

if (hasCoopCoep()) {
  console.log('Multi-threaded WASM available');
} else {
  console.log('Falling back to single-threaded mode');
}

3. Quick Start#

3.1 Installation#

<script type="module">
import { initLlm, chat, chatStream, embed } from './pkg/index.js';

// Initialize with model URL
await initLlm({
  modelUrl: 'https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q8_0.gguf',
  onProgress: (loaded, total) => {
    console.log(`Loading: ${Math.round(loaded / total * 100)}%`);
  },
});

// Simple chat
const response = await chat("What is the capital of France?");
console.log(response.content);
</script>

3.2 Streaming Responses#

For real-time token-by-token output:

await chatStream("Tell me a story about a brave knight", (token) => {
  document.getElementById('output').textContent += token;
}, {
  maxTokens: 200,
  temperature: 0.8,
});

3.3 Embeddings#

Generate vector embeddings for semantic search:

const embedding = await embed("Hello world");
console.log(embedding.vector);  // Float32Array
console.log(embedding.dimension);  // e.g., 384

4. API Reference#

4.1 Core Functions#

Function	Description	Returns
`initLlm(config)`	Initialize engine and load model	`Promise<void>`
`chat(prompt, options?)`	Generate completion	`Promise<ChatResponse>`
`chatStream(prompt, onToken, options?)`	Stream tokens	`Promise<ChatResponse>`
`embed(text)`	Generate embeddings	`Promise<EmbedResponse>`
`isLlmReady()`	Check if initialized	`boolean`
`disposeLlm()`	Free resources	`Promise<void>`

4.2 Configuration Options#

interface InitLlmConfig {
  modelUrl?: string;           // URL to GGUF model file
  modelBasePath?: string;      // Base path for manifest-based loading
  onProgress?: (loaded: number, total: number) => void;
  onReady?: () => void;
  useCache?: boolean;          // Default: true
  singleThread?: boolean;      // Force single-thread mode
  nCtx?: number;               // Context size (default: 2048)
  verbose?: boolean;           // Enable logging
}

4.3 Chat Options#

interface ChatOptions {
  maxTokens?: number;    // Default: 100
  temperature?: number;  // Default: 0.7
  topP?: number;         // Default: 0.9
  topK?: number;         // Default: 0 (disabled)
  stop?: string[];       // Stop sequences
  system?: string;       // System prompt
}

5. Demo Application#

A live demo is available at: TEA WASM LLM Demo

5.1 Features#

Chat Interface: Interactive conversation with Gemma 3 1B
Model Loading Progress: Real-time download progress indicator
Cache Status: Shows cached model size and status
YAML Workflow: Execute TEA YAML workflows with LLM actions
Multi-threading Indicator: Shows threading mode

5.2 Demo Architecture#

docs/wasm-demo/
├── index.html          # Main demo page with chat UI
├── app.js              # Demo application logic
├── style.css           # Responsive styling
├── coi-serviceworker.js # COOP/COEP headers for GitHub Pages
└── README.md           # Local development instructions

6. Browser Compatibility#

Browser	Single-thread	Multi-thread	Notes
Chrome 90+	Yes	Yes	Full support
Firefox 90+	Yes	Yes	Full support
Safari 15+	Yes	No	No SharedArrayBuffer
Edge 90+	Yes	Yes	Full support

6.1 Memory Requirements#

Mode	Minimum RAM	Recommended RAM
Loading	2 GB	4 GB
Inference	3 GB	6 GB

7. Comparison with Alternatives#

7.1 vs. Traditional Callback Bridge#

Aspect	Callback Bridge	Batteries Included
Dependencies	@wllama/wllama npm	None
Setup code	~20 lines	~3 lines
Failure points	Multiple	Single
Bundle size	Smaller (but split)	Larger (but unified)

7.2 vs. Server-Based APIs#

Aspect	Server API	Browser WASM
Privacy	Data leaves device	Data stays local
Latency	Network dependent	Zero network
Cost	API charges	Free after download
Offline	No	Yes
Setup	API keys required	Just import

8. Performance Considerations#

8.1 Initial Load#

First visit: Download ~1.3GB model (cached after)
Subsequent visits: Load from IndexedDB (~5-10 seconds)
WASM compilation: ~2-3 seconds

8.2 Inference Speed#

Approximate token generation speeds on consumer hardware:

CPU	Single-thread	Multi-thread
Intel i7 (8th gen)	~5 tok/s	~15 tok/s
Apple M1	~8 tok/s	~25 tok/s
AMD Ryzen 5	~6 tok/s	~18 tok/s

Speeds vary based on model size, context length, and browser.

9. Conclusion#

TEA WASM LLM provides a simple, batteries-included solution for browser-based LLM inference. The bundled architecture eliminates dependency management while IndexedDB caching enables efficient repeated use. The combination of privacy-preserving local inference and offline capability opens new possibilities for edge AI applications.

9.1 Future Work#

Smaller quantized models for faster initial load
WebGPU acceleration when broadly available
Persistent conversation memory
Model fine-tuning in browser

Browser-Based LLM Inference with TEA WASM

Contents