1 Star 3 Fork 2

Cetteest/pycorrector

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

alt text

PyPI version Downloads GitHub contributors License Apache 2.0 python_vesion GitHub issues Wechat Group

English | 简体中文

pycorrector

中文文本纠错工具。支持中文音似、形似、语法错误纠正,python3开发。

pycorrector实现了Kenlm、ConvSeq2Seq、BERT、MacBERT、ELECTRA、ERNIE、Transformer等多种模型的文本纠错,并在SigHAN数据集评估各模型的效果。

Guide

Question

中文文本纠错任务,常见错误类型:

当然,针对不同业务场景,这些问题并不一定全部存在,比如拼音输入法、语音识别校对关注音似错误;五笔输入法、OCR校对关注形似错误, 搜索引擎query纠错关注所有错误类型。

本项目重点解决其中的"音似、形字、语法、专名错误"等类型。

Solution

规则的解决思路

依据语言模型检测错别字位置,通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。

  1. 中文纠错分为两步走,第一步是错误检测,第二步是错误纠正;
  2. 错误检测部分先通过结巴中文分词器切词,由于句子中含有错别字,所以切词结果往往会有切分错误的情况,这样从字粒度和词粒度两方面检测错误, 整合这两种粒度的疑似错误结果,形成疑似错误位置候选集;
  3. 错误纠正部分,是遍历所有的疑似错误位置,并使用音似、形似词典替换错误位置的词,然后通过语言模型计算句子困惑度,对所有候选集结果比较并排序,得到最优纠正词。

深度模型的解决思路

  1. 端到端的深度模型可以避免人工提取特征,减少人工工作量,RNN序列模型对文本任务拟合能力强,RNN Attn在英文文本纠错比赛中取得第一名成绩,证明应用效果不错;
  2. CRF会计算全局最优输出节点的条件概率,对句子中特定错误类型的检测,会根据整句话判定该错误,阿里参赛2016中文语法纠错任务并取得第一名,证明应用效果不错;
  3. Seq2Seq模型是使用Encoder-Decoder结构解决序列转换问题,目前在序列转换任务中(如机器翻译、对话生成、文本摘要、图像描述)使用最广泛、效果最好的模型之一;
  4. BERT/ELECTRA/ERNIE/MacBERT等预训练模型强大的语言表征能力,对NLP届带来翻天覆地的改变,海量的训练数据拟合的语言模型效果无与伦比,基于其MASK掩码的特征,可以简单改造预训练模型用于纠错,加上fine-tune,效果轻松达到最优。

PS:

Feature

  • Kenlm模型:本项目基于Kenlm统计语言模型工具训练了中文NGram语言模型,结合规则方法、混淆集可以纠正中文拼写错误,方法速度快,扩展性强,效果一般
  • MacBERT模型【推荐】:本项目基于PyTorch实现了用于中文文本纠错的MacBERT4CSC模型,模型加入了错误检测和纠正网络,适配中文拼写纠错任务,效果好
  • Seq2Seq模型:本项目基于PyTorch实现了用于中文文本纠错的Seq2Seq模型、ConvSeq2Seq模型,其中ConvSeq2Seq在NLPCC-2018的中文语法纠错比赛中,使用单模型并取得第三名,可以并行训练,模型收敛快,效果一般
  • T5模型:本项目基于PyTorch实现了用于中文文本纠错的T5模型,使用Langboat/mengzi-t5-base的预训练模型fine-tune中文纠错数据集,模型改造的潜力较大,效果好
  • BERT模型:本项目基于PyTorch实现了基于原生BERT的fill-mask能力进行纠正错字的方法,效果差
  • ELECTRA模型:本项目基于PyTorch实现了基于原生ELECTRA的fill-mask能力进行纠正错字的方法,效果差
  • ERNIE_CSC模型:本项目基于PaddlePaddle实现了用于中文文本纠错的ERNIE_CSC模型,模型在ERNIE-1.0上fine-tune,模型结构适配了中文拼写纠错任务,效果好
  • DeepContext模型:本项目基于PyTorch实现了用于文本纠错的DeepContext模型,该模型结构参考Stanford University的NLC模型,2014英文纠错比赛得第一名,效果一般
  • Transformer模型:本项目基于PyTorch的fairseq库调研了Transformer模型用于中文文本纠错,效果一般

思考

  1. 规则的方法,在词粒度的错误召回还不错,但错误纠正的准确率还有待提高,更多优质的纠错集及纠错词库会有提升,我更希望算法模型上有更大的突破。
  2. 现在的文本错误不再局限于字词粒度上的拼写错误,需要提高中文语法错误检测(CGED, Chinese Grammar Error Diagnosis)及纠正能力,列在TODO中,后续调研。

Demo

Official Demo: https://www.mulanai.com/product/corrector/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/pycorrector

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Evaluation

提供评估脚本examples/evaluate_models.py

  • 使用sighan15评估集:SIGHAN2015的测试集pycorrector/data/cn/sighan_2015/test.tsv ,已经转为简体中文。
  • 评估标准:纠错准召率,采用严格句子粒度(Sentence Level)计算方式,把模型纠正之后的与正确句子完成相同的视为正确,否则为错。

评估结果

评估数据集:SIGHAN2015测试集

GPU:Tesla V100,显存 32 GB

模型 Backbone GPU Precision Recall F1 QPS
Rule(pycorrector.correct) kenlm CPU 0.6860 0.1529 0.2500 9
BERT bert-base-chinese GPU 0.8029 0.4052 0.5386 2
BART fnlp/bart-base-chinese GPU 0.6984 0.6354 0.6654 58
T5 byt5-small GPU 0.5220 0.3941 0.4491 111
Mengzi-T5 mengzi-t5-base GPU 0.8321 0.6390 0.7229 214
ConvSeq2Seq ConvSeq2Seq GPU 0.2415 0.1436 0.1801 6
MacBert macbert-base-chinese GPU 0.8254 0.7311 0.7754 224

结论

  • 中文拼写纠错模型效果最好的是MacBert,模型名称是shibing624/macbert4csc-base-chinese,huggingface model:shibing624/macbert4csc-base-chinese
  • 中文语法纠错模型效果最好的是Seq2Seq,模型名称是shibing624/bart4csc-base-chinese,huggingface model:shibing624/bart4csc-base-chinese
  • 最具潜力的模型是T5,模型名称是shibing624/mengzi-t5-base-chinese-correction,huggingface model:shibing624/mengzi-t5-base-chinese-correction,未改变模型结构,仅fine-tune中文纠错数据集,已经在SIGHAN 2015取得接近SOTA的效果

Install

pip install -U pycorrector

or

pip install -r requirements.txt

git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
pip install --no-deps .

通过以上两种方法的任何一种完成安装都可以。如果不想安装依赖包,直接使用docker拉取安装好的部署环境即可。

安装依赖

  • docker使用
docker run -it -v ~/.pycorrector:/root/.pycorrector shibing624/pycorrector:0.0.2

后续调用python使用即可,该镜像已经安装好kenlm、pycorrector等包,具体参见Dockerfile

使用示例:

docker

  • kenlm安装
pip install https://github.com/kpu/kenlm/archive/master.zip

安装kenlm-wiki

  • 其他库包安装
pip install -r requirements.txt

Usage

文本纠错

example: examples/base_demo.py

import pycorrector

corrected_sent, detail = pycorrector.correct('少先队员因该为老人让坐')
print(corrected_sent, detail)

output:

少先队员应该为老人让座 [('因该', '应该', 4, 6), ('坐', '座', 10, 11)]

规则方法默认会从路径~/.pycorrector/datasets/zh_giga.no_cna_cmn.prune01244.klm加载kenlm语言模型文件,如果检测没有该文件, 则程序会自动联网下载。当然也可以手动下载模型文件(2.8G)并放置于该位置。

错误检测

example: examples/detect_demo.py

import pycorrector

idx_errors = pycorrector.detect('少先队员因该为老人让坐')
print(idx_errors)

output:

[['因该', 4, 6, 'word'], ['坐', 10, 11, 'char']]

返回类型是list, [error_word, begin_pos, end_pos, error_type]pos索引位置以0开始。

成语、专名纠错

example: examples/proper_correct_demo.py

import sys

sys.path.append("..")
from pycorrector.proper_corrector import ProperCorrector

m = ProperCorrector()
x = [
    '报应接中迩来',
    '今天在拼哆哆上买了点苹果',
]

for i in x:
    print(i, ' -> ', m.proper_correct(i))

output:

报应接中迩来  ->  ('报应接踵而来', [('接中迩来', '接踵而来', 2, 6)])
今天在拼哆哆上买了点苹果  ->  ('今天在拼多多上买了点苹果', [('拼哆哆', '拼多多', 3, 6)])

自定义混淆集

通过加载自定义混淆集,支持用户纠正已知的错误,包括两方面功能:1)【提升准确率】误杀加白;2)【提升召回率】补充召回。

example: examples/use_custom_confusion.py

import pycorrector

error_sentences = [
    '买iphonex,要多少钱',
    '共同实际控制人萧华、霍荣铨、张旗康',
]
for line in error_sentences:
    print(pycorrector.correct(line))

print('*' * 42)
pycorrector.set_custom_confusion_path_or_dict('./my_custom_confusion.txt')
for line in error_sentences:
    print(pycorrector.correct(line))

output:

('买iphonex,要多少钱', [])   # "iphonex"漏召,应该是"iphoneX"
('共同实际控制人萧华、霍荣铨、张启康', [['张旗康', '张启康', 14, 17]]) # "张启康"误杀,应该不用纠
*****************************************************
('买iphonex,要多少钱', [['iphonex', 'iphoneX', 1, 8]])
('共同实际控制人萧华、霍荣铨、张旗康', [])

其中./my_custom_confusion.txt的内容格式如下,以空格间隔:

iPhone差 iPhoneX
张旗康 张旗康

混淆集功能在correct方法中生效; set_custom_confusion_dict方法的path参数为用户自定义混淆集文件路径(str)或混淆集字典(dict)。

自定义语言模型

默认提供下载并使用的kenlm语言模型zh_giga.no_cna_cmn.prune01244.klm文件是2.8G,内存小的电脑使用pycorrector程序可能会吃力些。

支持用户加载自己训练的kenlm语言模型,或使用2014版人民日报数据训练的模型,模型小(140M),准确率稍低,模型下载地址:people2014corpus_chars.klm(密码o5e9)

example:examples/load_custom_language_model.py

from pycorrector import Corrector
import os

pwd_path = os.path.abspath(os.path.dirname(__file__))
lm_path = os.path.join(pwd_path, './people2014corpus_chars.klm')
model = Corrector(language_model_path=lm_path)

corrected_sent, detail = model.correct('少先队员因该为老人让坐')
print(corrected_sent, detail)

output:

少先队员应该为老人让座 [('因该', '应该', 4, 6), ('坐', '座', 10, 11)]

英文拼写纠错

支持英文单词级别的拼写错误纠正。

example:examples/en_correct_demo.py

import pycorrector

sent = "what happending? how to speling it, can you gorrect it?"
corrected_text, details = pycorrector.en_correct(sent)
print(sent, '=>', corrected_text)
print(details)

output:

what happending? how to speling it, can you gorrect it?
=> what happening? how to spelling it, can you correct it?
[('happending', 'happening', 5, 15), ('speling', 'spelling', 24, 31), ('gorrect', 'correct', 44, 51)]

中文简繁互换

支持中文繁体到简体的转换,和简体到繁体的转换。

example:examples/traditional_simplified_chinese_demo.py

import pycorrector

traditional_sentence = '憂郁的臺灣烏龜'
simplified_sentence = pycorrector.traditional2simplified(traditional_sentence)
print(traditional_sentence, '=>', simplified_sentence)

simplified_sentence = '忧郁的台湾乌龟'
traditional_sentence = pycorrector.simplified2traditional(simplified_sentence)
print(simplified_sentence, '=>', traditional_sentence)

output:

憂郁的臺灣烏龜 => 忧郁的台湾乌龟
忧郁的台湾乌龟 => 憂郁的臺灣烏龜

命令行模式

支持批量文本纠错

python -m pycorrector -h
usage: __main__.py [-h] -o OUTPUT [-n] [-d] input

@description:

positional arguments:
  input                 the input file path, file encode need utf-8.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        the output file path.
  -n, --no_char         disable char detect mode.
  -d, --detail          print detail info

case:

python -m pycorrector input.txt -o out.txt -n -d

输入文件:input.txt;输出文件:out.txt ;关闭字粒度纠错;打印详细纠错信息;纠错结果以\t间隔

Deep Model Usage

本项目的初衷之一是比对、共享各种文本纠错方法,抛砖引玉的作用,如果对大家在文本纠错任务上有一点小小的启发就是我莫大的荣幸了。

主要使用了多种深度模型应用于文本纠错任务,分别是前面模型小节介绍的macbertseq2seqbertelectratransformerernie-cscT5,各模型方法内置于pycorrector文件夹下,有README.md详细指导,各模型可独立运行,相互之间无依赖。

  • 安装依赖
pip install -r requirements-dev.txt

使用方法

各模型均可独立的预处理数据、训练、预测。

MacBert4csc模型[推荐]

基于MacBERT改变网络结构的中文拼写纠错模型,模型已经开源在HuggingFace Models:https://huggingface.co/shibing624/macbert4csc-base-chinese

模型网络结构:

  • 本项目是 MacBERT 改变网络结构的中文文本纠错模型,可支持 BERT 类模型为 backbone。
  • 在原生 BERT 模型上进行了魔改,追加了一个全连接层作为错误检测即 detection , MacBERT4CSC 训练时用 detection 层和 correction 层的 loss 加权得到最终的 loss。预测时用 BERT MLM 的 correction 权重即可。

macbert_network

详细教程参考pycorrector/macbert/README.md

example:examples/macbert_demo.py

使用pycorrector调用纠错:

import sys

sys.path.append("..")
from pycorrector.macbert.macbert_corrector import MacBertCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻烦你了。希望你们好好的跳无',
        '少先队员因该为老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
        '一只小鱼船浮在平净的河面上',
        '我的家乡是有明的渔米之乡',
    ]

    m = MacBertCorrector("shibing624/macbert4csc-base-chinese")
    for line in error_sentences:
        correct_sent, err = m.macbert_correct(line)
        print("query:{} => {}, err:{}".format(line, correct_sent, err))

output:

query:真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好的跳舞, err:[('无', '舞', 14, 15)]
query:少先队员因该为老人让坐 => 少先队员应该为老人让坐, err:[('因', '应', 4, 5)]
query:机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分知, err:[('七', '器', 1, 2), ('遇', '域', 10, 11)]
query:一只小鱼船浮在平净的河面上 => 一只小鱼船浮在平净的河面上, err:[]
query:我的家乡是有明的渔米之乡 => 我的家乡是有名的渔米之乡, err:[('明', '名', 6, 7)]

使用原生transformers库调用纠错:

import operator
import torch
from transformers import BertTokenizerFast, BertForMaskedLM
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizerFast.from_pretrained("shibing624/macbert4csc-base-chinese")
model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
model.to(device)

texts = ["今天新情很好", "你找到你最喜欢的工作,我也很高心。"]

text_tokens = tokenizer(texts, padding=True, return_tensors='pt').to(device)
with torch.no_grad():
    outputs = model(**text_tokens)

def get_errors(corrected_text, origin_text):
    sub_details = []
    for i, ori_char in enumerate(origin_text):
        if ori_char in [' ', '“', '”', '‘', '’', '\n', '…', '—', '擤']:
            # add unk word
            corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
            continue
        if i >= len(corrected_text):
            break
        if ori_char != corrected_text[i]:
            if ori_char.lower() == corrected_text[i]:
                # pass english upper char
                corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
                continue
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

result = []
for ids, (i, text) in zip(outputs.logits, enumerate(texts)):
    _text = tokenizer.decode((torch.argmax(ids, dim=-1) * text_tokens.attention_mask[i]),
                             skip_special_tokens=True).replace(' ', '')
    corrected_text, details = get_errors(_text, text)
    print(text, ' => ', corrected_text, details)
    result.append((corrected_text, details))
print(result)

output:

今天新情很好  =>  今天心情很好 [('新', '心', 2, 3)]
你找到你最喜欢的工作,我也很高心。  =>  你找到你最喜欢的工作,我也很高兴。 [('心', '兴', 15, 16)]

模型文件:

macbert4csc-base-chinese
    ├── config.json
    ├── added_tokens.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

ErnieCSC模型

基于ERNIE的中文拼写纠错模型,模型已经开源在PaddleNLP的 模型库中https://bj.bcebos.com/paddlenlp/taskflow/text_correction/csc-ernie-1.0/csc-ernie-1.0.pdparams

模型网络结构:

详细教程参考pycorrector/ernie_csc/README.md

example:examples/ernie_csc_demo.py

使用pycorrector调用纠错:


from pycorrector.ernie_csc.ernie_csc_corrector import ErnieCSCCorrector

if __name__ == '__main__':
    error_sentences = [
        '真麻烦你了。希望你们好好的跳无',
        '少先队员因该为老人让坐',
        '机七学习是人工智能领遇最能体现智能的一个分知',
        '一只小鱼船浮在平净的河面上',
        '我的家乡是有明的渔米之乡',
    ]
    corrector = ErnieCSCCorrector("csc-ernie-1.0")
    for line in error_sentences:
        result = corrector.ernie_csc_correct(line)[0]
        print("query:{} => {}, err:{}".format(line, result['target'], result['errors']))

output:


query:真麻烦你了。希望你们好好的跳无 => 真麻烦你了。希望你们好好的跳舞, err:[{'position': 14, 'correction': {'无': '舞'}}]
query:少先队员因该为老人让坐 => 少先队员应该为老人让座, err:[{'position': 4, 'correction': {'因': '应'}}, {'position': 10, 'correction': {'坐': '座'}}]
query:机七学习是人工智能领遇最能体现智能的一个分知 => 机器学习是人工智能领域最能体现智能的一个分知, err:[{'position': 1, 'correction': {'七': '器'}}, {'position': 10, 'correction': {'遇': '域'}}]
query:一只小鱼船浮在平净的河面上 => 一只小鱼船浮在平净的河面上, err:[]
query:我的家乡是有明的渔米之乡 => 我的家乡是有名的渔米之乡, err:[{'position': 6, 'correction': {'明': '名'}}]

使用PaddleNLP库调用纠错:

可以使用PaddleNLP提供的Taskflow工具来对输入的文本进行一键纠错,具体使用方法如下:


from paddlenlp import Taskflow

text_correction = Taskflow("text_correction")
text_correction('遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。')
text_correction('人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。')

output:


[{'source': '遇到逆竟时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。',
    'target': '遇到逆境时,我们必须勇于面对,而且要愈挫愈勇,这样我们才能朝著成功之路前进。',
    'errors': [{'position': 3, 'correction': {'竟': '境'}}]}]

[{'source': '人生就是如此,经过磨练才能让自己更加拙壮,才能使自己更加乐观。',
    'target': '人生就是如此,经过磨练才能让自己更加茁壮,才能使自己更加乐观。',
    'errors': [{'position': 18, 'correction': {'拙': '茁'}}]}]

Bart模型

from transformers import BertTokenizerFast
from textgen import BartSeq2SeqModel

tokenizer = BertTokenizerFast.from_pretrained('shibing624/bart4csc-base-chinese')
model = BartSeq2SeqModel(
    encoder_type='bart',
    encoder_decoder_type='bart',
    encoder_decoder_name='shibing624/bart4csc-base-chinese',
    tokenizer=tokenizer,
    args={"max_length": 128, "eval_batch_size": 128})
sentences = ["少先队员因该为老人让坐"]
print(model.predict(sentences))

output:

['少先队员应该为老人让座']

如果需要训练Bart模型,请参考 https://github.com/shibing624/textgen/blob/main/examples/seq2seq/training_bartseq2seq_zh_demo.py

Release models

基于SIGHAN+Wang271K中文纠错数据集训练的Bart模型,已经release到HuggingFace Models:

ConvSeq2Seq模型

pycorrector/seq2seq 模型使用示例:

训练

data example:

# train.txt:
你说的是对,跟那些失业的人比起来你也算是辛运的。	你说的是对,跟那些失业的人比起来你也算是幸运的。
cd seq2seq
python train.py

convseq2seq训练sighan数据集(2104条样本),200个epoch,单卡P40GPU训练耗时:3分钟。

预测

python infer.py

output:

result image

  1. 如果训练数据太少(不足万条),深度模型拟合不足,会出现预测结果全为unk的情况,解决方法:增大训练样本集,使用下方提供的纠错熟语料(nlpcc2018+hsk,130万对句子)试试。
  2. 深度模型训练耗时长,有GPU尽量用GPU,加速训练,节省时间。

Release models

基于SIGHAN2015数据集训练的convseq2seq模型,已经release到github:

Dataset

数据集 语料 下载链接 压缩包大小
SIGHAN+Wang271K中文纠错数据集 SIGHAN+Wang271K(27万条) 百度网盘(密码01b9) 106M
原始SIGHAN数据集 SIGHAN13 14 15 官方csc.html 339K
原始Wang271K数据集 Wang271K Automatic-Corpus-Generation dimmywang提供 93M
人民日报2014版语料 人民日报2014版 飞书(密码cHcu) 383M
NLPCC 2018 GEC官方数据集 NLPCC2018-GEC 官方trainingdata 114M
NLPCC 2018+HSK熟语料 nlpcc2018+hsk+CGED 百度网盘(密码m6fg)
飞书(密码gl9y)
215M
NLPCC 2018+HSK原始语料 HSK+Lang8 百度网盘(密码n31j)
飞书(密码Q9LH)
81M

说明:

  • SIGHAN+Wang271K中文纠错数据集(27万条),是通过原始SIGHAN13、14、15年数据集和Wang271K数据集格式转化后得到,json格式,带错误字符位置信息,SIGHAN为test.json, macbert4csc模型训练可以直接用该数据集复现paper准召结果,详见pycorrector/macbert/README.md
  • NLPCC 2018 GEC官方数据集NLPCC2018-GEC, 训练集trainingdata[解压后114.5MB],该数据格式是原始文本,未做切词处理。
  • 汉语水平考试(HSK)和lang8原始平行语料[HSK+Lang8]百度网盘(密码n31j),该数据集已经切词,可用作数据扩增。
  • NLPCC 2018 + HSK + CGED16、17、18的数据,经过以字切分,繁体转简体,打乱数据顺序的预处理后,生成用于纠错的熟语料(nlpcc2018+hsk) ,百度网盘(密码:m6fg) [130万对句子,215MB]

Language Model

什么是语言模型?-wiki

语言模型对于纠错步骤至关重要,当前默认使用的是从千兆中文文本训练的中文语言模型zh_giga.no_cna_cmn.prune01244.klm(2.8G), 提供人民日报2014版语料训练得到的轻量版语言模型people2014corpus_chars.klm(密码o5e9)

大家可以用中文维基(繁体转简体,pycorrector.utils.text_utils下有此功能)等语料数据训练通用的语言模型,或者也可以用专业领域语料训练更专用的语言模型。更适用的语言模型,对于纠错效果会有比较好的提升。

  1. kenlm语言模型训练工具的使用,请见博客:http://blog.csdn.net/mingzai624/article/details/79560063
  2. 附上训练语料<人民日报2014版熟语料>,包括: 1)标准人工切词及词性数据people2014.tar.gz, 2)未切词文本数据people2014_words.txt, 3)kenlm训练字粒度语言模型文件及其二进制文件people2014corpus_chars.arps/klm, 4)kenlm词粒度语言模型文件及其二进制文件people2014corpus_words.arps/klm。

尊重版权,传播请注明出处。

Todo

  • 优化形似字字典,提高形似字纠错准确率
  • 整理中文纠错训练数据,使用seq2seq做深度中文纠错模型
  • 添加中文语法错误检测及纠正能力
  • 规则方法添加用户自定义纠错集,并将其纠错优先度调为最高
  • seq2seq_attention 添加dropout,减少过拟合
  • 在seq2seq模型框架上,新增Pointer-generator network、Beam search、Unknown words replacement、Coverage mechanism等特性
  • 更新bert的fine-tuned使用wiki,适配transformers 2.10.0库
  • 升级代码,兼容TensorFlow 2.0库
  • 升级bert纠错逻辑,提升基于mask的纠错效果
  • 新增基于electra模型的纠错逻辑,参数更小,预测更快
  • 新增专用于纠错任务深度模型,使用bert/ernie预训练模型,加入文本音似、形似特征。

Contact

  • Github Issue(建议):GitHub issues
  • Github discussions:欢迎到讨论区GitHub discussions灌水(不会打扰开发者),公开交流纠错技术和问题
  • 邮件我:xuming: xuming624@qq.com
  • 微信我:加我微信号:xuming624, 进Python-NLP交流群,备注:姓名-公司名-NLP

Citation

如果你在研究中使用了pycorrector,请按如下格式引用:

APA:

Xu, M. Pycorrector: Text error correction tool (Version 0.4.2) [Computer software]. https://github.com/shibing624/pycorrector

BibTeX:

@software{Xu_Pycorrector_Text_error,
author = {Xu, Ming},
title = {Pycorrector: Text error correction tool},
url = {https://github.com/shibing624/pycorrector},
version = {0.4.2}
}

License

pycorrector 的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加pycorrector的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Reference

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

文本纠错,用于提升成绩 展开 收起
README
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/cette/pycorrector.git
git@gitee.com:cette/pycorrector.git
cette
pycorrector
pycorrector
master

搜索帮助