1 Star 0 Fork 0

孙晓雪 / python-stanford-corenlp

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT

Stanford CoreNLP Python Interface

NOTE: This package is now deprecated. Please use the stanfordnlp package instead.

https://travis-ci.org/stanfordnlp/python-stanford-corenlp.svg?branch=master

This package contains a python interface for Stanford CoreNLP that contains a reference implementation to interface with the Stanford CoreNLP server. The package also contains a base class to expose a python-based annotation provider (e.g. your favorite neural NER system) to the CoreNLP pipeline via a lightweight service.

To use the package, first download the official java CoreNLP release, unzip it, and define an environment variable $CORENLP_HOME that points to the unzipped directory.

You can also install this package from PyPI using pip install stanford-corenlp


Command Line Usage

Probably the easiest way to use this package is through the annotate command-line utility:

usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}]
                [-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY]
                [-p PROPS [PROPS ...]]

Annotate data

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file to process; each line contains one document
                        (default: stdin)
  -o OUTPUT, --output OUTPUT
                        File to write annotations to (default: stdout)
  -f {json}, --format {json}
                        Output format
  -a ANNOTATORS [ANNOTATORS ...], --annotators ANNOTATORS [ANNOTATORS ...]
                        A list of annotators
  -s, --sentence-mode   Assume each line of input is a sentence.
  -v, --verbose-server  Server is made verbose
  -m MEMORY, --memory MEMORY
                        Memory to use for the server
  -p PROPS [PROPS ...], --props PROPS [PROPS ...]
                        Properties as a list of key=value pairs

We recommend using annotate in conjuction with the wonderful jq command to process the output. As an example, given a file with a sentence on each line, the following command produces an equivalent space-separated tokens:

cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt

Annotation Server Usage

import corenlp

text = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP."

# We assume that you've downloaded Stanford CoreNLP and defined an environment
# variable $CORENLP_HOME that points to the unzipped directory.
# The code below will launch StanfordCoreNLPServer in the background
# and communicate with the server to annotate the sentence.
with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split()) as client:
  ann = client.annotate(text)

# You can access annotations using ann.
sentence = ann.sentence[0]

# The corenlp.to_text function is a helper function that
# reconstructs a sentence from tokens.
assert corenlp.to_text(sentence) == text

# You can access any property within a sentence.
print(sentence.text)

# Likewise for tokens
token = sentence.token[0]
print(token.lemma)

# Use tokensregex patterns to find who wrote a sentence.
pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'
matches = client.tokensregex(text, pattern)
# sentences contains a list with matches for each sentence.
assert len(matches["sentences"]) == 1
# length tells you whether or not there are any matches in this
assert matches["sentences"][0]["length"] == 1
# You can access matches like most regex groups.
matches["sentences"][1]["0"]["text"] == "Chris wrote a simple sentence"
matches["sentences"][1]["0"]["1"]["text"] == "Chris"

# Use semgrex patterns to directly find who wrote what.
pattern = '{word:wrote} >nsubj {}=subject >dobj {}=object'
matches = client.semgrex(text, pattern)
# sentences contains a list with matches for each sentence.
assert len(matches["sentences"]) == 1
# length tells you whether or not there are any matches in this
assert matches["sentences"][0]["length"] == 1
# You can access matches like most regex groups.
matches["sentences"][1]["0"]["text"] == "wrote"
matches["sentences"][1]["0"]["$subject"]["text"] == "Chris"
matches["sentences"][1]["0"]["$object"]["text"] == "sentence"

See test_client.py and test_protobuf.py for more examples. Props to @dan-zheng for tokensregex/semgrex support.

Annotation Service Usage

NOTE: The annotation service allows users to provide a custom annotator to be used by the CoreNLP pipeline. Unfortunately, it relies on experimental code internal to the Stanford CoreNLP project is not yet available for public use.

import corenlp
from .happyfuntokenizer import Tokenizer

class HappyFunTokenizer(Tokenizer, corenlp.Annotator):
    def __init__(self, preserve_case=False):
        Tokenizer.__init__(self, preserve_case)
        corenlp.Annotator.__init__(self)

    @property
    def name(self):
        """
        Name of the annotator (used by CoreNLP)
        """
        return "happyfun"

    @property
    def requires(self):
        """
        Requires has to specify all the annotations required before we
        are called.
        """
        return []

    @property
    def provides(self):
        """
        The set of annotations guaranteed to be provided when we are done.
        NOTE: that these annotations are either fully qualified Java
        class names or refer to nested classes of
        edu.stanford.nlp.ling.CoreAnnotations (as is the case below).
        """
        return ["TextAnnotation",
                "TokensAnnotation",
                "TokenBeginAnnotation",
                "TokenEndAnnotation",
                "CharacterOffsetBeginAnnotation",
                "CharacterOffsetEndAnnotation",
               ]

    def annotate(self, ann):
        """
        @ann: is a protobuf annotation object.
        Actually populate @ann with tokens.
        """
        buf, beg_idx, end_idx = ann.text.lower(), 0, 0
        for i, word in enumerate(self.tokenize(ann.text)):
            token = ann.sentencelessToken.add()
            # These are the bare minimum required for the TokenAnnotation
            token.word = word
            token.tokenBeginIndex = i
            token.tokenEndIndex = i+1

            # Seek into the txt until you can find this word.
            try:
                # Try to update beginning index
                beg_idx = buf.index(word, beg_idx)
            except ValueError:
                # Give up -- this will be something random
                end_idx = beg_idx + len(word)

            token.beginChar = beg_idx
            token.endChar = end_idx

            beg_idx, end_idx = end_idx, end_idx

annotator = HappyFunTokenizer()
# Calling .start() will launch the annotator as a service running on
# port 8432 by default.
annotator.start()

# annotator.properties contains all the right properties for
# Stanford CoreNLP to use this annotator.
with corenlp.CoreNLPClient(properties=annotator.properties, annotators="happyfun ssplit pos".split()) as client:
    ann = client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")

    tokens = [t.word for t in ann.sentence[0].token]
    print(tokens)

See test_annotator.py for more examples.

MIT License Copyright (c) 2017 Stanford NLP Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

Python interface to CoreNLP using a bidirectional server-client interface. 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/sunxiaoxue113/python-stanford-corenlp.git
git@gitee.com:sunxiaoxue113/python-stanford-corenlp.git
sunxiaoxue113
python-stanford-corenlp
python-stanford-corenlp
master

搜索帮助