# tokenizers.js **Repository Path**: mirrors_huggingface/tokenizers.js ## Basic Information - **Project Name**: tokenizers.js - **Description**: 🤗 Tokenizers.js: A pure JS/TS implementation of today's most used tokenizers - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-16 - **Last Updated**: 2025-12-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
A lightweight tokenizer for the Web
Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in [🤗 Transformers.js](https://github.com/huggingface/transformers.js) ## Features - Lightweight (~ 8.3kB gzip) - Zero dependencies - Works in browsers and Node.js ## Installation ```bash npm install @huggingface/tokenizers ``` Alternatively, you can use it via a CDN as follows: ```html ``` ## Usage ```javascript import { Tokenizer } from "@huggingface/tokenizers"; // Load files from the Hugging Face Hub const modelId = "HuggingFaceTB/SmolLM3-3B"; const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json()); const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json()); // Create tokenizer const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig); // Tokenize text const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'Ä World'] const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'Ä World'], attention_mask: [1, 1] } const decoded = tokenizer.decode(encoded.ids); // 'Hello World' ``` ## Requirements This library expects two files from Hugging Face models: - `tokenizer.json` - Contains the tokenizer configuration - `tokenizer_config.json` - Contains additional metadata ## Components Tokenizers.js supports [Hugging Face tokenizer components](https://huggingface.co/docs/tokenizers/components): ### Normalizers - NFD - NFKC - NFC - NFKD - Lowercase - Strip - StripAccents - Replace - BERT Normalizer - Precompiled - Sequence ### Pre-tokenizers - BERT - ByteLevel - Whitespace - WhitespaceSplit - Metaspace - CharDelimiterSplit - Split - Punctuation - Digits ### Models - BPE (Byte-Pair Encoding) - WordPiece - Unigram - Legacy ### Post-processors - ByteLevel - TemplateProcessing - RobertaProcessing - BertProcessing - Sequence ### Decoders - ByteLevel - WordPiece - Metaspace - BPE - CTC - Replace - Fuse - Strip - ByteFallback - Sequence