--- sidebar_label: Transformers.js description: Run local LLM inference with Transformers.js for embeddings and text generation without external APIs --- # Transformers.js The Transformers.js provider enables fully local inference using [Transformers.js v4](https://huggingface.co/docs/transformers.js), running ONNX-optimized models directly in Node.js without external APIs or GPU setup. v4 features a new WebGPU backend, broader model support (8B+ parameter models), and improved performance. ## Installation Transformers.js is an optional dependency (~200MB for ONNX runtime): ```bash npm install @huggingface/transformers ``` ## Quick Start ### Embeddings ```yaml providers: - transformers:feature-extraction:Xenova/all-MiniLM-L6-v2 ``` Popular models: `Xenova/all-MiniLM-L6-v2` (384d), `onnx-community/all-MiniLM-L6-v2-ONNX` (384d), `Xenova/bge-small-en-v1.5` (384d), `nomic-ai/nomic-embed-text-v1.5` (768d) ### Text Generation ```yaml providers: - transformers:text-generation:Xenova/gpt2 ``` Popular models: `Xenova/gpt2`, `onnx-community/Qwen3-0.6B-ONNX`, `onnx-community/Llama-3.2-1B-Instruct-ONNX` :::note Text generation runs on CPU and is best for testing. For production, consider [Ollama](/docs/providers/ollama) or cloud APIs. ::: ## Configuration ### Common Options These options apply to both embedding and text generation providers: | Option | Description | Default | | ---------------- | ------------------------------------------------------------------------------------------ | -------------- | | `device` | `'auto'`, `'cpu'`, `'gpu'`, `'wasm'`, `'webgpu'`, `'cuda'`, `'dml'`, `'coreml'`, `'webnn'` | `'auto'` | | `dtype` | Quantization: `'fp32'`, `'fp16'`, `'q8'`, `'q4'`, `'q4f16'` | `'auto'` | | `cacheDir` | Override model cache directory | System default | | `localFilesOnly` | Skip downloads, use cached models only | `false` | | `revision` | Model version/branch | `'main'` | ### Embedding Options ```yaml providers: - id: transformers:feature-extraction:Xenova/bge-small-en-v1.5 config: prefix: 'query: ' # Required for BGE, E5 models pooling: mean # 'mean', 'cls', 'first_token', 'eos', 'last_token', 'none' normalize: true # L2 normalize embeddings dtype: q8 ``` **Model prefixes:** BGE and E5 models require `prefix: 'query: '` for queries or `prefix: 'passage: '` for documents. MiniLM models need no prefix. :::tip `transformers:embeddings:` is an alias for `transformers:feature-extraction:`. ::: ### Text Generation Options ```yaml providers: - id: transformers:text-generation:onnx-community/Qwen3-0.6B-ONNX config: maxNewTokens: 256 temperature: 0.7 topK: 50 topP: 0.9 doSample: true repetitionPenalty: 1.1 noRepeatNgramSize: 3 numBeams: 1 returnFullText: false dtype: q4 ``` ## Using for Similarity Assertions Use local embeddings as a grading provider for `similar` assertions: ```yaml defaultTest: options: provider: embedding: id: transformers:feature-extraction:Xenova/all-MiniLM-L6-v2 providers: - openai:gpt-4o-mini tests: - vars: question: 'What is photosynthesis?' assert: - type: similar value: 'Photosynthesis converts light to chemical energy in plants' threshold: 0.8 ``` Or override per-assertion: ```yaml assert: - type: similar value: 'Expected output' threshold: 0.75 provider: transformers:feature-extraction:Xenova/all-MiniLM-L6-v2 ``` ## Performance - **Caching:** Pipelines are cached after first load. Initial model download may take time, but subsequent runs are fast. - **Quantization:** Use `dtype: q4` or `dtype: q8` for faster inference and lower memory. Use `dtype: q4f16` for WebGPU-optimized quantization. - **WebGPU:** v4 includes a new WebGPU runtime written in C++ with significantly improved performance. Use `device: webgpu` on supported systems. - **Concurrency:** For limited RAM, use `promptfoo eval -j 1` to run serially. ## Troubleshooting | Problem | Solution | | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | Dependency not installed | Run `npm install @huggingface/transformers` | | Model not found | Verify model exists at [HuggingFace](https://huggingface.co/models?library=transformers.js) with ONNX weights. Try `Xenova` or `onnx-community` models. | | Out of memory | Use `dtype: q4`, run with `-j 1`, or try smaller models | | Slow first run | Models download on first use. Pre-download with `await pipeline('feature-extraction', 'model-name')` | ## Supported Models Browse compatible models at [huggingface.co/models?library=transformers.js](https://huggingface.co/models?library=transformers.js). Key organizations: **onnx-community** (optimized ONNX exports, recommended for v4), **Xenova** (legacy ONNX models, still compatible)