# jieba-rust-api **Repository Path**: wen-open/jieba-rust-api ## Basic Information - **Project Name**: jieba-rust-api - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-18 - **Last Updated**: 2026-04-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # jieba-rust-api 中文说明见下方;英文对照见文末 **[English](#english)** 一节。 基于 jieba 的中文分词 REST API 服务。 A high-performance Chinese word-segmentation HTTP API built on jieba-rs. --- ## 简体中文 ## 项目简介 jieba-rust-api 是一个使用 Rust 语言实现的高性能中文分词 REST API,基于业界广泛使用的中文分词库 jieba 实现。该项目提供 HTTP API 接口,支持中文文本分词、关键词提取等功能,适用于中文自然语言处理场景。 ## 功能特性 - **中文分词**:支持 `mix`(默认,与常见结巴默认行为一致,可配 HMM)、`precise`(`cut`,默认关闭 HMM)、`full`(全模式 `cut_all`)、`search`(搜索引擎模式 `cut_for_search`);`mix` / `precise` / `search` 可通过可选字段 `hmm` 控制是否启用 HMM - **用户词典**:支持加载自定义用户词典,提升分词准确度 - **热重载**:支持在线重载用户词典,无需重启服务 - **健康检查**:提供服务健康状态检查接口 - **高性能**:基于 Rust + Axum 构建高性能异步服务 - **Docker 支持**:提供 Docker 镜像,方便部署 ## 技术栈 - Rust 1.85 - Axum (Web 框架) - Tokio (异步运行时) - jieba-rust (中文分词库) - Tower (中间件) - Serde (序列化/反序列化) ## 快速开始 ### 使用 Docker 运行 ```bash # 使用 docker-compose docker-compose up -d # 或者手动运行 docker run -d -p 8080:8080 \ -e LISTEN_ADDR=0.0.0.0:8080 \ -v /path/to/user_dict.txt:/data/user_dict/user_dict.txt \ jieba-rust-api ``` ### 本地开发运行 ```bash # 编译项目 cargo build --release # 运行服务 ./target/release/jieba-rust-api ``` ## API 接口 | 方法 | 路径 | 说明 | |------|------|------| | `POST` | `/v1/cut` | 中文分词 | | `POST` | `/v1/reload-dict` | 从磁盘重新加载用户词典 | | `GET` | `/health` | 健康检查 | 服务根地址示例:`http://<主机>:8080`(本地默认 `http://127.0.0.1:8080`)。完整 URL 为根地址 + 上表路径,例如 `http://127.0.0.1:8080/v1/cut`。 ### curl 测试示例(可复制单行) 下列示例均使用 `http://127.0.0.1:8080`。JSON 使用双引号并转义,便于在 **bash、CMD、PowerShell** 下直接粘贴(PowerShell 若 `curl` 被别名占用,请改用 `curl.exe`)。 **健康检查 | Health** ```bash curl -s http://127.0.0.1:8080/health ``` **分词 `mix`(默认开 HMM;显式关闭 HMM 见第二条)| Cut `mix` (HMM on by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"mix\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"mix\",\"hmm\":false}" ``` **分词 `precise`(默认关 HMM;显式开 HMM 见第二条)| Cut `precise` (HMM off by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"precise\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"precise\",\"hmm\":true}" ``` **分词 `full`(全模式,`hmm` 无效)| Cut `full` (all mode; `hmm` ignored)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"full\"}" ``` **分词 `search`(默认开 HMM;显式关闭 HMM 见第二条)| Cut `search` (HMM on by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"search\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"search\",\"hmm\":false}" ``` **重载用户词典 | Reload user dictionary** ```bash curl -s -X POST http://127.0.0.1:8080/v1/reload-dict ``` ### 分词接口 **路径**:`POST /v1/cut` **请求** ```bash POST /v1/cut Content-Type: application/json { "text": "待分词的中文文本", "mode": "mix", "hmm": true } ``` - `mode`(可选,默认 `mix`):`mix`(默认结巴,开 HMM)、`precise`(`cut` 且可关 HMM)、`full`(全模式)、`search`(搜索引擎模式)。 - `hmm`(可选):仅对 `mix` / `precise` / `search` 有效;不传时 `mix`/`search` 默认 `true`,`precise` 默认 `false`。 **响应** ```json { "words": ["分词", "结果"], "mode": "mix", "hmm": true } ``` ### 重载词典 **路径**:`POST /v1/reload-dict` **请求** ```bash POST /v1/reload-dict ``` **响应** ```json { "ok": true, "loaded_words": 100 } ``` ### 健康检查 **路径**:`GET /health` **请求** ```bash GET /health ``` **响应** ```json { "status": "ok" } ``` ## 环境变量 | 变量名 | 说明 | 默认值 | |--------|------|--------| | `LISTEN_ADDR` | 服务监听地址 | `0.0.0.0:8080` | | `JIBA_USER_DICT_PATH` | 用户词典路径 | - | | `JIBA_LOG_DIR` | 日志目录 | `/var/log/jieba-api` | | `RUST_LOG` | 日志级别 | `info` | ## 配置说明 ### 用户词典格式 用户词典文件每行一个词语,格式如下: ``` 词语1 词频1 词性1 词语2 词频2 词性2 ``` 例如: ``` 云计算 5 n 深度学习 3 n ``` ## 许可证 本项目基于 MIT 许可证开源。 --- ## English ## Project overview jieba-rust-api is a high-performance Chinese word-segmentation REST API implemented in Rust, built on the widely used jieba segmentation library. It exposes HTTP endpoints for Chinese text segmentation, keyword extraction, and related scenarios in Chinese NLP. ## Features - **Chinese segmentation**: supports `mix` (default, matches common jieba default behavior, HMM configurable), `precise` (`cut`, HMM off by default), `full` (full mode `cut_all`), `search` (search mode `cut_for_search`); optional `hmm` for `mix` / `precise` / `search` - **User dictionary**: load a custom user dictionary to improve accuracy - **Hot reload**: reload the user dictionary online without restarting the service - **Health check**: health endpoint for service status - **Performance**: high-throughput async service built with Rust and Axum - **Docker**: Docker image provided for easy deployment ## Tech stack - Rust 1.85 - Axum (web framework) - Tokio (async runtime) - jieba-rs (Chinese segmentation) - Tower (middleware) - Serde (serialization) ## Quick start ### Run with Docker ```bash # docker-compose docker-compose up -d # or manual run docker run -d -p 8080:8080 \ -e LISTEN_ADDR=0.0.0.0:8080 \ -v /path/to/user_dict.txt:/data/user_dict/user_dict.txt \ jieba-rust-api ``` ### Local development ```bash # compile project cargo build --release # run service ./target/release/jieba-rust-api ``` ## API reference | Method | Path | Description | |--------|------|-------------| | `POST` | `/v1/cut` | Chinese word segmentation | | `POST` | `/v1/reload-dict` | Reload user dictionary from disk | | `GET` | `/health` | Health check | Example base URL: `http://:8080` (local default `http://127.0.0.1:8080`). Full URLs are base URL + path from the table above, e.g. `http://127.0.0.1:8080/v1/cut`. ### Curl test examples (single-line, copy-paste) All examples use `http://127.0.0.1:8080`. JSON uses escaped double quotes so the same lines work in **bash, CMD, and PowerShell** (on PowerShell, prefer `curl.exe` if `curl` is aliased). **健康检查 | Health** ```bash curl -s http://127.0.0.1:8080/health ``` **分词 `mix`(默认开 HMM;显式关闭 HMM 见第二条)| Cut `mix` (HMM on by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"mix\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"mix\",\"hmm\":false}" ``` **分词 `precise`(默认关 HMM;显式开 HMM 见第二条)| Cut `precise` (HMM off by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"precise\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"precise\",\"hmm\":true}" ``` **分词 `full`(全模式,`hmm` 无效)| Cut `full` (all mode; `hmm` ignored)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"full\"}" ``` **分词 `search`(默认开 HMM;显式关闭 HMM 见第二条)| Cut `search` (HMM on by default)** ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"search\"}" ``` ```bash curl -s -X POST http://127.0.0.1:8080/v1/cut -H "Content-Type: application/json" -d "{\"text\":\"我来到北京清华大学\",\"mode\":\"search\",\"hmm\":false}" ``` **重载用户词典 | Reload user dictionary** ```bash curl -s -X POST http://127.0.0.1:8080/v1/reload-dict ``` ### Segmentation endpoint **Path**: `POST /v1/cut` **Request** ```bash POST /v1/cut Content-Type: application/json { "text": "待分词的中文文本", "mode": "mix", "hmm": true } ``` - `mode` (optional, default `mix`): `mix` (default jieba, HMM on), `precise` (`cut`, HMM can be off), `full` (full mode), `search` (search mode). - `hmm` (optional): only applies to `mix` / `precise` / `search`; when omitted, `mix` and `search` default to `true`, `precise` defaults to `false`. **Response** ```json { "words": ["分词", "结果"], "mode": "mix", "hmm": true } ``` ### Reload dictionary **Path**: `POST /v1/reload-dict` **Request** ```bash POST /v1/reload-dict ``` **Response** ```json { "ok": true, "loaded_words": 100 } ``` ### Health check **Path**: `GET /health` **Request** ```bash GET /health ``` **Response** ```json { "status": "ok" } ``` ## Environment variables | Name | Description | Default | |------|-------------|---------| | `LISTEN_ADDR` | Listen address | `0.0.0.0:8080` | | `JIBA_USER_DICT_PATH` | User dictionary path | - | | `JIBA_LOG_DIR` | Log directory | `/var/log/jieba-api` | | `RUST_LOG` | Log level | `info` | ## Configuration ### User dictionary format Each line of the user dictionary file is one entry, in this form: ``` 词语1 词频1 词性1 词语2 词频2 词性2 ``` For example: ``` 云计算 5 n 深度学习 3 n ``` ## License This project is open source under the MIT License.