# validateutf8-experiments **Repository Path**: mirrors_lemire/validateutf8-experiments ## Basic Information - **Project Name**: validateutf8-experiments - **Description**: Reproducible experimeents on UTF-8 validation using SIMD instructions - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-25 - **Last Updated**: 2026-05-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # validateutf8-experiments [![CI Tests](https://github.com/lemire/validateutf8-experiments/actions/workflows/ci.yml/badge.svg)](https://github.com/lemire/validateutf8-experiments/actions/workflows/ci.yml) This project contains benchmarks regarding fast UTF-8 validation. It is for research purposes only: not for production use. If you are not doing research, this repository is not for you! The algorithm we designed is called lookup. We experimented with several variants (lookup2, lookup3, lookup4) that have similar performance. The lookup approach is one of the fastest ways to validate UTF-8 strings. Please see the file `src/generate/utf8_lookup4_algorithm.h` for details. The algorithm of this repository has been included in production-ready libraries: - [simdutf](https://github.com/simdutf/simdutf) C++ library is part of important systems such as Bun, Node.js, WebKit/Safari, etc. - [SimdUnicode](https://github.com/simdutf/SimdUnicode) is a C# port of the validation algorithm, adapted for the .NET runtime. ## Code organization The repository is organized around three main concerns: SIMD validation algorithms, benchmarking, and testing. - `src/`: core UTF-8 validation code. - `src/generic/`: generic algorithm definitions (`utf8_lookup2_algorithm.h`, `utf8_lookup3_algorithm.h`, `utf8_lookup4_algorithm.h`, etc.). - `src/avx2/`, `src/sse/`, `src/neon/`: architecture-specific SIMD wrappers and implementations. - `benchmarks/`: benchmark drivers and supporting utilities. - `benchmarks/benchmark.cpp`: runs algorithm comparisons on synthetic and real data. - `benchmarks/benchstream.cpp`: streaming-oriented benchmark. - `benchmarks/random_utf8.*`: randomized UTF-8 data generation. - `tests/`: unit tests (`tests/unit.cpp`) validating correctness across implementations. - `examples/`: sample real-world UTF-8 inputs used by benchmarks. - `dependencies/`: third-party code used by benchmarks. - `.github/workflows/`: CI configuration for build and test automation. Typical workflow: 1. Build with CMake. 2. Run `./build/unit` to validate correctness. 3. Run `./build/benchmark` and `./build/benchstream` to compare performance. ## Hardware requirements The code supports multiple processor architectures including x64 with AVX2, SSE, as well as ARM64 with NEON. ## Reproducible experiments To ensure that experiments are reproducible, we rely on a docker image. We recommend that you install docker under Linux. ## Testing Run the unit tests locally with: ```bash cmake -B build cmake --build build ctest --test-dirs build ``` Starting in a bash shell do: ``` git clone https://github.com/lemire/validateutf8-experiments.git cmake -B build cmake --build build ./build/benchmark ./build/benchstream 1000 ``` In some cases, you may need to run the benchmarks in privileged mode (sudo) to get performance counters. ## Example ``` ./build/benchmark Running UTF8 validation benchmark. The speed is normalized by the number of input bytes. == file: examples/hongkong.html (1807 KB) == name | ins/byte | br.miss/KB | GHz | GB/s | margin% | ins/cyc -------------------+----------+------------+--------+---------+---------+-------- memcpy | 0.000 | 0.001 | 3.497 | 17.862 | 1.69 | 0.001 fushia | 6.641 | 8.187 | 3.493 | 2.612 | 26.77 | 4.966 fushia_ascii | 2.738 | 7.617 | 3.493 | 4.236 | 4.04 | 3.320 fushia_ascii2 | 2.594 | 7.440 | 3.493 | 5.327 | 4.96 | 3.955 fushia_ascii4 | 2.707 | 8.602 | 3.493 | 4.363 | 5.53 | 3.381 utf8lib | 6.724 | 8.057 | 3.493 | 2.570 | 2.80 | 4.947 dfa | 9.000 | 0.003 | 3.489 | 0.581 | 1.15 | 1.500 fdfa | 9.000 | 0.003 | 3.490 | 0.496 | 1.09 | 1.278 bdfa | 8.000 | 0.002 | 3.489 | 0.581 | 1.18 | 1.333 dfa2 | 7.500 | 0.002 | 3.490 | 1.163 | 2.63 | 2.499 dfa3 | 7.000 | 0.003 | 3.489 | 1.722 | 1.44 | 3.456 dfa4 | 9.000 | 0.004 | 3.489 | 0.581 | 1.17 | 1.499 zwegneravx | 0.440 | 2.777 | 3.496 | 19.748 | 3.40 | 2.487 lookup2 | 0.447 | 1.743 | 3.495 | 18.458 | 5.26 | 2.363 lookup3 | 0.417 | 1.711 | 3.495 | 19.325 | 2.79 | 2.305 lookup4 | 0.413 | 1.728 | 3.495 | 20.934 | 3.99 | 2.476 basic | 0.609 | 1.645 | 3.495 | 15.483 | 2.19 | 2.698 range | 0.567 | 1.600 | 3.494 | 16.024 | 3.42 | 2.599 == file: examples/twitter.json (631 KB) == name | ins/byte | br.miss/KB | GHz | GB/s | margin% | ins/cyc -------------------+----------+------------+--------+---------+---------+-------- memcpy | 0.000 | 0.005 | 3.509 | 44.150 | 3.35 | 0.005 fushia | 7.158 | 3.431 | 3.493 | 2.811 | 21.55 | 5.760 fushia_ascii | 3.641 | 2.114 | 3.470 | 4.544 | 3.74 | 4.767 fushia_ascii2 | 3.469 | 1.639 | 3.495 | 6.142 | 3.20 | 6.096 fushia_ascii4 | 3.487 | 1.823 | 3.494 | 5.484 | 4.52 | 5.472 utf8lib | 7.310 | 3.894 | 3.493 | 2.577 | 0.54 | 5.392 dfa | 9.000 | 0.006 | 3.490 | 0.582 | 0.17 | 1.500 fdfa | 9.000 | 0.006 | 3.490 | 0.496 | 0.93 | 1.278 bdfa | 8.000 | 0.006 | 3.490 | 0.582 | 1.12 | 1.333 dfa2 | 7.500 | 0.006 | 3.493 | 1.164 | 0.99 | 2.499 dfa3 | 7.000 | 0.008 | 3.493 | 1.730 | 1.27 | 3.467 dfa4 | 9.000 | 0.010 | 3.490 | 0.582 | 1.15 | 1.500 zwegneravx | 0.465 | 0.048 | 3.510 | 41.029 | 1.96 | 5.434 lookup2 | 0.432 | 0.057 | 3.506 | 30.675 | 2.18 | 3.776 lookup3 | 0.403 | 0.048 | 3.507 | 33.500 | 1.54 | 3.846 lookup4 | 0.400 | 0.052 | 3.508 | 37.000 | 2.05 | 4.220 basic | 0.586 | 0.062 | 3.500 | 22.538 | 2.22 | 3.772 range | 0.546 | 0.081 | 3.301 | 22.668 | 2.03 | 3.750 ``` ## Reference - John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021 ## Citation If you use this repository, please cite it as follows: ```bibtex @misc{lemire2021validateutf8experiments, author = {Lemire, Daniel}, title = {validateutf8-experiments: Fast {UTF-8} Validation Benchmarks}, year = {2021}, url = {https://github.com/lemire/validateutf8-experiments}, } ``` ## Credit A lot of the hard work is due to Keiser. Some of the code is based on code by Muła. The first SIMD UTF-8 validator was based on work by Willets. Some of our improvments were motivated by work by Zwegner who produced some of the finest SIMD-based UTF-8 validators.