# elasticsearch-jieba-plugin
**Repository Path**: guorumin/elasticsearch-jieba-plugin
## Basic Information
- **Project Name**: elasticsearch-jieba-plugin
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 1
- **Created**: 2020-03-07
- **Last Updated**: 2021-11-03
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# elasticsearch-jieba-plugin
jieba analysis plugin for elasticsearch: ***7.4.2***, ***7.3.0***, ***7.0.0***, ***6.4.0***, ***6.0.0***, ***5.4.0***, ***5.3.0***, ***5.2.2***, ***5.2.1***, ***5.2.0***, ***5.1.2***, ***5.1.1***
### 有关jieba_index和jieba_search的应用
[戳这里](about_jieba_index_jieba_search.md)
### 新分词支持
- [thulac分词ES插件](https://github.com/microbun/elasticsearch-thulac-plugin), [thulac官网](http://thulac.thunlp.org/)
### 如果是ES6.4.0的版本,请使用6.4.0分支最新的代码,或者master分支最新代码,也可以下载6.4.1的release,强烈推荐升级!
#### 6.4.1的release,解决了PositionIncrement问题。详细说明见[ES分词PositionIncrement解析](https://github.com/sing1ee/kotlin-road/blob/master/ES-analysis-positionincrement.md)
### 版本对应
| 分支 | tag | elasticsearch版本 | Release Link |
| --- | --- | --- | --- |
| 7.4.2 | tag v7.4.2 | v7.4.2 | Download: [v7.4.2](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v7.4.2) |
| 7.3.0 | tag v7.3.0 | v7.3.0 | Download: [v7.3.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v7.3.0) |
| 7.0.0 | tag v7.0.0 | v7.0.0 | Download: [v7.0.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v7.0.0) |
| 6.4.0 | tag v6.4.1 | v6.4.0 | Download: [v6.4.1](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v6.4.1) |
| 6.4.0 | tag v6.4.0 | v6.4.0 | Download: [v6.4.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v6.4.0) |
| 6.0.0 | tag v6.0.0 | v6.0.0 | Download: [v6.0.1](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v6.0.1) |
| 5.4.0 | tag v5.4.0 | v5.4.0 | Download: [v5.4.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.4.0) |
| 5.3.0 | tag v5.3.0 | v5.3.0 | Download: [v5.3.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.3.0) |
| 5.2.2 | tag v5.2.2 | v5.2.2 | Download: [v5.2.2](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.2.2) |
| 5.2.1 | tag v5.2.1 | v5.2.1 | Download: [v5.2.1](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.2.1) |
| 5.2 | tag v5.2.0 | v5.2.0 | Download: [v5.2.0](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.2.0) |
| 5.1.2 | tag v5.1.2 | v5.1.2 | Download: [v5.1.2](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.1.2) |
| 5.1.1 | tag v5.1.1 | v5.1.1 | Download: [v5.1.1](https://github.com/sing1ee/elasticsearch-jieba-plugin/releases/tag/v5.1.1) |
### more details
- choose right version source code.
- run
```shell
gradle pz
```
- copy the zip file to plugin directory
```shell
cp build/distributions/elasticsearch-jieba-plugin-5.1.2.zip ${path.home}/plugins
```
- unzip and rm zip file
```shell
unzip elasticsearch-jieba-plugin-5.1.2.zip
rm elasticsearch-jieba-plugin-5.1.2.zip
```
- start elasticsearch
```shell
./bin/elasticsearch
```
### Custom User Dict
Just put you dict file with suffix ***.dict*** into ${path.home}/plugins/jieba/dic. Your dict
file should like this:
```shell
小清新 3
百搭 3
显瘦 3
隨身碟 100
your_word word_freq
```
### Using stopwords
- find stopwords.txt in ${path.home}/plugins/jieba/dic.
- create folder named ***stopwords*** under ${path.home}/config
```shell
mkdir -p {path.home}/config/stopwords
```
- copy stopwords.txt into the folder just created
```shell
cp ${path.home}/plugins/jieba/dic/stopwords.txt {path.home}/config/stopwords
```
- create index:
```shell
PUT http://localhost:9200/jieba_index
```
```json
{
"settings": {
"analysis": {
"filter": {
"jieba_stop": {
"type": "stop",
"stopwords_path": "stopwords/stopwords.txt"
},
"jieba_synonym": {
"type": "synonym",
"synonyms_path": "synonyms/synonyms.txt"
}
},
"analyzer": {
"my_ana": {
"tokenizer": "jieba_index",
"filter": [
"lowercase",
"jieba_stop",
"jieba_synonym"
]
}
}
}
}
}
```
- test analyzer:
```shell
GET http://localhost:9200/jieba_index/_analyze?analyzer=my_ana&text=中国的伟大时代来临了,欢迎参观北京大学PKU
```
Response as follow:
```json
{
"tokens": [
{
"token": "中国",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "伟大",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "时代",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "来临",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 4
},
{
"token": "欢迎",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
},
{
"token": "参观",
"start_offset": 13,
"end_offset": 15,
"type": "word",
"position": 8
},
{
"token": "北京",
"start_offset": 15,
"end_offset": 17,
"type": "word",
"position": 9
},
{
"token": "大学",
"start_offset": 17,
"end_offset": 19,
"type": "word",
"position": 10
},
{
"token": "北京大",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 11
},
{
"token": "北京大学",
"start_offset": 15,
"end_offset": 19,
"type": "word",
"position": 12
},
{
"token": "北大",
"start_offset": 15,
"end_offset": 19,
"type": "SYNONYM",
"position": 12
},
{
"token": "pku",
"start_offset": 15,
"end_offset": 19,
"type": "SYNONYM",
"position": 12
},
{
"token": "pku",
"start_offset": 19,
"end_offset": 22,
"type": "word",
"position": 13
},
{
"token": "北大",
"start_offset": 19,
"end_offset": 22,
"type": "SYNONYM",
"position": 13
},
{
"token": "北京大学",
"start_offset": 19,
"end_offset": 22,
"type": "SYNONYM",
"position": 13
}
]
}
```
- Pay attention to ***jieba_synonym**, same with ***jieba_stop***, the format of synoyms.txt:
```shell
北京大学,北大,pku
清华大学,清华,Tsinghua University
```
- create document
```shell
POST http://localhost:9200/jieba_index/fulltext/1
```
```json
{"content":"中国的伟大时代来临了,欢迎参观北京大学PKU"}
```
- search
```shell
POST http://localhost:9200/jieba_index/fulltext/_search
```
Request body:
```json
{
"query" : { "match" : { "content" : "pku" }},
"highlight" : {
"pre_tags" : ["", ""],
"post_tags" : ["", ""],
"fields" : {
"content" : {}
}
}
}
```
Response body:
```json
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.52305835,
"hits": [
{
"_index": "jieba_index",
"_type": "fulltext",
"_id": "1",
"_score": 0.52305835,
"_source": {
"content": "中国的伟大时代来临了,欢迎参观北京大学PKU"
},
"highlight": {
"content": [
"中国的伟大时代来临了,欢迎参观北京大学PKU"
]
}
}
]
}
}
```
- 聚合示例(aggregation)
Query:
```json
{
"query": {
"match": {
"name": "lala"
}
},
"_source": [
"name"
],
"aggs": {
"dedup": {
"terms": {
"field": "your_agg_field"
},
"aggs": {
"dedup_docs": {
"top_hits": {
"sort": [
{
"updatedAt": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"name"
]
},
"size": 2
}
}
}
},
"facets": {
"terms": {
"field": "your_facet_field"
},
"aggs": {
"facets_docs": {
"top_hits": {
"sort": [
{
"updatedAt": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"name"
]
},
"size": 1
}
}
}
}
}
}
```
### NOTE
migrate from [jieba-solr](https://github.com/sing1ee/jieba-solr)
### Roadmap
I will add more analyzer support:
- stanford chinese analyzer
- fudan nlp analyzer
- ...
If you have some ideas, you should create an issue. Then, we will do it together.