6 Star 5 Fork 3

Gitee 极速下载 / Prose-Go

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
此仓库是为了提升国内下载速度的镜像仓库,每日同步一次。 原始仓库: https://github.com/jdkato/prose
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

prose Build Status GoDoc Coverage Status Go Report Card codebeat badge Awesome

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Contents

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type Example
Email addresses Jane.Doe@example.com
Hashtags #trending
Mentions @jdkato
URLs https://github.com/jdkato/prose
Emoticons :-), >:(, o_0, etc.
package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name Language License GRS (English) GRS (Other) Speed†
Pragmatic Segmenter Ruby MIT 98.08% (51/52) 100.00% 3.84 s
prose Go MIT 75.00% (39/52) N/A 0.96 s
TactfulTokenizer Ruby GNU GPLv3 65.38% (34/52) 48.57% 46.32 s
OpenNLP Java APLv2 59.62% (31/52) 45.71% 1.27 s
Standford CoreNLP Java GNU GPLv3 59.62% (31/52) 31.43% 0.92 s
Splitta Python APLv2 55.77% (29/52) 37.14% N/A
Punkt Python APLv2 46.15% (24/52) 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% (16/52) 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% (15/52) 20.00% 0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library Accuracy 5-Run Average (sec)
NLTK 0.893 7.224
prose 0.961 2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG DESCRIPTION
( left round bracket
) right round bracket
, comma
: colon
. period
'' closing quotation mark
`` opening quotation mark
# number sign
$ currency
CC conjunction, coordinating
CD cardinal number
DT determiner
EX existential there
FW foreign word
IN conjunction, subordinating or preposition
JJ adjective
JJR adjective, comparative
JJS adjective, superlative
LS list item marker
MD verb, modal auxiliary
NN noun, singular or mass
NNP noun, proper singular
NNPS noun, proper plural
NNS noun, plural
PDT predeterminer
POS possessive ending
PRP pronoun, personal
PRP$ pronoun, possessive
RB adverb
RBR adverb, comparative
RBS adverb, superlative
RP adverb, particle
SYM symbol
TO infinitival to
UH interjection
VB verb, base form
VBD verb, past tense
VBG verb, gerund or present participle
VBN verb, past participle
VBP verb, non-3rd person singular present
VBZ verb, 3rd person singular present
WDT wh-determiner
WP wh-pronoun, personal
WP$ wh-pronoun, possessive
WRB wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

MIT License Copyright (c) 2017 -2018 Joseph Kato Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

Prose 是一款用于Go 语言的文本处理库(主要是英文),支持 okenization(标记化)、part-of-speech tagging(词性标注)、named-entit 展开 收起
Go 等 4 种语言
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Go
1
https://gitee.com/mirrors/Prose-Go.git
git@gitee.com:mirrors/Prose-Go.git
mirrors
Prose-Go
Prose-Go
master

搜索帮助