# gse **Repository Path**: veni0/gse ## Basic Information - **Project Name**: gse - **Description**: Go 语言高效分词, 支持英文、中文、日文等 - **Primary Language**: Go - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 130 - **Forks**: 0 - **Created**: 2017-11-16 - **Last Updated**: 2025-08-07 ## Categories & Tags **Categories**: segment **Tags**: None ## README # gse Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. And supports with [elasticsearch](https://github.com/vcaesar/go-gse-elastic) and [bleve](https://github.com/vcaesar/gse-bleve). [![Build Status](https://github.com/go-ego/gse/workflows/Go/badge.svg)](https://github.com/go-ego/gse/commits/master) [![CircleCI Status](https://circleci.com/gh/go-ego/gse.svg?style=shield)](https://circleci.com/gh/go-ego/gse) [![codecov](https://codecov.io/gh/go-ego/gse/branch/master/graph/badge.svg)](https://codecov.io/gh/go-ego/gse) [![Build Status](https://travis-ci.org/go-ego/gse.svg)](https://travis-ci.org/go-ego/gse) [![Go Report Card](https://goreportcard.com/badge/github.com/go-ego/gse)](https://goreportcard.com/report/github.com/go-ego/gse) [![GoDoc](https://godoc.org/github.com/go-ego/gse?status.svg)](https://godoc.org/github.com/go-ego/gse) [![GitHub release](https://img.shields.io/github/release/go-ego/gse.svg)](https://github.com/go-ego/gse/releases/latest) [![Join the chat at https://gitter.im/go-ego/ego](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/go-ego/ego?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) [简体中文](https://github.com/go-ego/gse/blob/master/README_zh.md) Gse is implements jieba by golang, and try add NLP support and more feature ## Feature: - Support common, search engine, full mode, precise mode and HMM mode multiple word segmentation modes; - Support user and embed dictionary, Part-of-speech/POS tagging, analyze segment info, stop and trim words - Support multilingual: English, Chinese, Japanese and other - Support traditional chinese - Support HMM cut text use Viterbi algorithm - Support NLP by TensorFlow (in work) - Named Entity Recognition (in work) - Supports with [elasticsearch](https://github.com/vcaesar/go-gse-elastic) and bleve - run JSON RPC service. ## Algorithm: - [Dictionary](https://github.com/go-ego/gse/blob/master/dictionary.go) with double array trie (Double-Array Trie) to achieve - [Segmenter](https://github.com/go-ego/gse/blob/master/segmenter.go) algorithm is the shortest path (based on word frequency and dynamic programming), and DAG and HMM algorithm word segmentation. ## Text Segmentation speed: - single thread 9.2MB/s - goroutines concurrent 26.8MB/s. - HMM text segmentation single thread 3.2MB/s. (2core 4threads Macbook Pro). ## Binding: [gse-bind](https://github.com/vcaesar/gse-bind), binding JavaScript and other, support more language. ## Install / update ``` go get -u github.com/go-ego/gse ``` ## Use ```go package main import ( "fmt" "regexp" "github.com/go-ego/gse" "github.com/go-ego/gse/hmm/pos" ) var ( text = "Hello world, Helloworld. Winter is coming! 你好世界." new, _ = gse.New("zh,testdata/test_dict3.txt", "alpha") seg gse.Segmenter posSeg pos.Segmenter ) func main() { // Loading the default dictionary seg.LoadDict() // Loading the default dictionary with embed // seg.LoadDictEmbed() // // Loading the simple chinese dictionary // seg.LoadDict("zh_s") // seg.LoadDictEmbed("zh_s") // // Loading the traditional chinese dictionary // seg.LoadDict("zh_t") // // Loading the japanese dictionary // seg.LoadDict("jp") // // Load the dictionary // seg.LoadDict("your gopath"+"/src/github.com/go-ego/gse/data/dict/dictionary.txt") cut() segCut() } func cut() { hmm := new.Cut(text, true) fmt.Println("cut use hmm: ", hmm) hmm = new.CutSearch(text, true) fmt.Println("cut search use hmm: ", hmm) fmt.Println("analyze: ", new.Analyze(hmm, text)) hmm = new.CutAll(text) fmt.Println("cut all: ", hmm) reg := regexp.MustCompile(`(\d+年|\d+月|\d+日|[\p{Latin}]+|[\p{Hangul}]+|\d+\.\d+|[a-zA-Z0-9]+)`) text1 := `헬로월드 헬로 서울, 2021年09月10日, 3.14` hmm = seg.CutDAG(text1, reg) fmt.Println("Cut with hmm and regexp: ", hmm, hmm[0], hmm[6]) } func analyzeAndTrim(cut []string) { a := seg.Analyze(cut, "") fmt.Println("analyze the segment: ", a) cut = seg.Trim(cut) fmt.Println("cut all: ", cut) fmt.Println(seg.String(text, true)) fmt.Println(seg.Slice(text, true)) } func cutPos() { po := seg.Pos(text, true) fmt.Println("pos: ", po) po = seg.TrimPos(po) fmt.Println("trim pos: ", po) pos.WithGse(seg) po = posSeg.Cut(text, true) fmt.Println("pos: ", po) po = posSeg.TrimWithPos(po, "zg") fmt.Println("trim pos: ", po) } func segCut() { // Text Segmentation tb := []byte(text) fmt.Println(seg.String(text, true)) segments := seg.Segment(tb) // Handle word segmentation results, search mode fmt.Println(gse.ToString(segments, true)) } ``` [Look at an custom dictionary example](/examples/dict/main.go) ```Go package main import ( "fmt" _ "embed" "github.com/go-ego/gse" ) //go:embed test_dict3.txt var testDict string func main() { // var seg gse.Segmenter // seg.LoadDict("zh, testdata/test_dict.txt, testdata/test_dict1.txt") // seg.LoadStop() seg, err := gse.NewEmbed("zh, word 20 n"+testDict, "en") // seg.LoadDictEmbed() seg.LoadStopEmbed() text1 := "你好世界, Hello world" fmt.Println(seg.Cut(text1, true)) fmt.Println(seg.String(text1, true)) segments := seg.Segment([]byte(text1)) fmt.Println(gse.ToString(segments)) } ``` [Look at an Chinese example](/examples/main.go) [Look at an Japanese example](/examples/jp/main.go) ## Elasticsearch How to use it with elasticsearch? [go-gse-elastic](https://github.com/vcaesar/go-gse-elastic) ## [Build-tools](https://github.com/go-ego/re) ``` go get -u github.com/go-ego/re ``` ### re gse To create a new gse application ``` $ re gse my-gse ``` ### re run To run the application we just created, you can navigate to the application folder and execute: ``` $ cd my-gse && re run ``` ## Authors - [Maintainers](https://github.com/orgs/go-ego/people) - [Contributors](https://github.com/go-ego/gse/graphs/contributors) ## License Gse is primarily distributed under the terms of "both the MIT license and the Apache License (Version 2.0)". See [LICENSE-APACHE](http://www.apache.org/licenses/LICENSE-2.0), [LICENSE-MIT](https://github.com/go-vgo/robotgo/blob/master/LICENSE). Thanks for [sego](https://github.com/huichen/sego) and [jieba](https://github.com/fxsjy/jieba)([jiebago](https://github.com/wangbin/jiebago)).