1 Star 0 Fork 0

赵涵/JioNLP

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

   ——JioNLP:中文 NLP 预处理工具包 A Python Library for Chinese NLP Preprocessing

   pip install -i https://test.pypi.org/simple/ jionlp

  • 做 NLP 任务,需要清洗、过滤语料?用 JioNLP
  • 做 NLP 任务,需要做信息抽取?用 JioNLP
  • 做 NLP 任务,需要做数据增强?用 JioNLP
  • 做 NLP 任务,需要给模型添加偏旁、拼音、词典、繁体转换信息?用 JioNLP

总之,JioNLP 提供 NLP 任务预处理功能,准确、高效、零使用门槛,并提供一步到位的查阅入口。

功能主要包括:文本清洗,去除HTML标签、异常字符、冗余字符,转换全角字母、数字、空格为半角,抽取及删除E-mail及域名、电话号码、QQ号、括号内容、身份证号、IP地址、URL超链接、货币金额与单位,金额数字转大写汉字,解析身份证号信息、手机号码归属地、座机区号归属地、手机号码运营商,按行快速读写文件,(多功能)停用词过滤,(优化的)分句,地址解析,新闻地域识别,繁简体转换,汉字转拼音,汉字偏旁、字形、四角编码拆解,基于词典的情感分析,色情数据过滤,反动数据过滤,关键短语抽取,抽取式文本摘要,成语接龙,成语词典、歇后语词典、新华字典、新华词典、停用词典、中国地名词典、世界地名词典,基于词典的NER,NER的字、词级别转换,NER的entity和tag格式转换,NER模型的预测阶段加速并行工具集,NER标注和模型预测的结果差异对比,NER标注数据集分割与统计,文本分类标注数据集的分割与统计,回译数据增强。

Update 2021-01-22

更新 地址解析

jio.parse_location

给定一个包含国内地址字符串,识别其中的省、市、县区、乡镇街道、村社等信息。新增旧地名自动转为新地名

>>> import jionlp as jio
>>> text = '长岛县小冯村'
>>> print(jio.parse_location(text, change2new=True, town_village=False))

# {'province': '山东省',
# 'city': '烟台市',
# 'county': '蓬莱区',
# 'detail': '小冯村',
# 'full_location': '山东省烟台市蓬莱区小冯村',
# 'orig_location': '长岛县小冯村'}

Update 2021-01-19

新增 金额数字转汉字

jio.money_num2char

给定一条数字金额,返回其汉字大写结果。

>>> import jionlp as jio
>>> num = 120402810.03
>>> print(jio.money_num2char(num, sim_or_tra='tra'))

# 壹亿贰仟零肆拾萬贰仟捌佰壹拾點零叁

安装 Installation

  • python>=3.6
$ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .
  • pip 安装 有时存在版本滞后
$ pip install -i https://test.pypi.org/simple/ jionlp

使用 Features

  • 导入工具包,查看工具包的主要功能与函数注释
>>> import jionlp as jio
>>> jio.help()  # 输入关键词搜索工具包是否包含某功能,如输入“回译”
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)
  • 在 Linux 系统,可使用以下命令做搜索:
$ jio_help

1.小工具集 Gadgets

功能 函数 描述
查找帮助 help 若不知道 JioNLP 有哪些功能,可根据命令行提示键入若干关键词做搜索
关键短语抽取 extract_keyphrase 给定一篇文本,抽取其对应关键短语
抽取式文本摘要 extract_summary 给定一篇文本,抽取其对应文摘
回译数据增强 BackTranslation 给定一篇文本,采用各大厂云平台的机器翻译接口,实现数据增强
停用词过滤 remove_stopwords 给定一个文本被分词后的词 list,去除其中的停用词
分句 split_sentence 对文本按标点分句。
地址解析 parse_location 给定一个包含国内地址字符串,识别其中的省、市、县区、乡镇街道、村社等信息
电话号码归属地
运营商解析
phone_location
cell_phone_location
landline_phone_location
给定一个电话号码字符串,识别其中的省、市、运营商
新闻地名识别 recognize_location 给定新闻文本,识别其中的国内省、市、县,国外国家、城市等信息
身份证号解析 parse_id_card 给定一个身份证号,识别对应的省、市、县、出生年月、
性别、校验码等信息
成语接龙 idiom_solitaire 成语接龙,即前一成语的尾字和后一成语的首字(读音)相同
色情数据过滤
反动数据过滤
繁体转简体 tra2sim 繁体转简体,支持逐字转最大匹配两种模式
简体转繁体 sim2tra 简体转繁体,支持逐字转最大匹配两种模式
汉字转拼音 pinyin 找出中文文本对应的汉语拼音,并可返回声母韵母声调
汉字转偏旁与字形 char_radical 找出中文文本对应的汉字字形结构信息,
包括偏旁部首(“河”氵)、字形结构(“河”左右结构)、
四角编码(“河”31120)、汉字拆解(“河”水可)
金额数字转汉字 money_num2char 给定一条数字金额,返回其汉字大写结果

2、正则抽取与解析

功能 函数 描述
清洗文本 clean_text 去除文本中的异常字符、冗余字符、HTML标签、括号信息、
URL、E-mail、电话号码,全角字母数字转换为半角
抽取 E-mail extract_email 抽取文本中的 E-mail,返回位置域名
抽取 金额 extract_money 抽取文本中的金额,并将其以数字 + 单位标准形式输出
抽取电话号码 extract_phone_number 抽取电话号码(含手机座机),返回域名类型位置
抽取中国身份证 ID extract_id_card 抽取身份证 ID,配合 jio.parse_id_card 返回身份证的
详细信息(省市县出生日期性别校验码)
抽取 QQ extract_qq 抽取 QQ 号,分为严格规则和宽松规则
抽取 URL extract_url 抽取 URL 超链接
抽取 IP地址 extract_ip_address 抽取 IP 地址
抽取括号中的内容 extract_parentheses 抽取括号内容,包括 {}「」[]【】()()<>《》
删除 E-mail remove_email 删除文本中的 E-mail 信息
删除 URL remove_url 删除文本中的 URL 信息
删除 电话号码 remove_phone_number 删除文本中的电话号码
删除 IP地址 remove_ip_address 删除文本中的 IP 地址
删除 身份证号 remove_id_card 删除文本中的身份证信息
删除 QQ remove_qq 删除文本中的 qq 号
删除 HTML标签 remove_html_tag 删除文本中残留的 HTML 标签
删除括号中的内容 remove_parentheses 删除括号内容,包括 {}「」[]【】()()<>《》
删除异常字符 remove_exception_char 删除文本中异常字符,主要保留汉字、常用的标点,
单位计算符号,字母数字等。

3. 文件读写工具

功能 函数 描述
按行读取文件 read_file_by_iter 以迭代器形式方便按行读取文件,节省内存,
支持指定行数跳过空行
按行读取文件 read_file_by_line 按行读取文件,支持指定行数跳过空行
将 list 中元素按行写入文件 write_file_by_line 将 list 中元素按行写入文件
计时工具 TimeIt 统计某一代码段的耗时

4.词典加载与使用

功能 函数 描述
成语词典 chinese_idiom_loader 加载成语词典
歇后语词典 xiehouyu_loader 加载歇后语词典
中国地名词典 china_location_loader 加载中国省、市、县三级词典
区划调整词典 china_location_change_loader 加载 2018 年以来中国县级以上区划调整更名记录
世界地名词典 world_location_loader 加载世界大洲、国家、城市词典
新华字典 chinese_char_dictionary_loader 加载新华字典
新华词典 chinese_word_dictionary_loader 加载新华词典

5.实体识别(NER)算法辅助工具集

功能 函数 描述
基于词典NER LexiconNER 依据指定的实体词典,前向最大匹配实体
entity 转 tag entity2tag 将 json 格式实体转换为模型处理的 tag 序列
tag 转 entity tag2entity 将模型处理的 tag 序列转换为 json 格式实体
字 token 转词 token char2word 将字符级别 token 转换为词汇级别 token
词 token 转字 token word2char 将词汇级别 token 转换为字符级别 token
比较标注与模型预测的实体差异 entity_compare 针对人工标注的实体,与模型预测出的实体结果
,做差异比对
NER模型预测加速 TokenSplitSentence
TokenBreakLongSentence
TokenBatchBucket
对 NER 模型预测并行加速的方法
分割数据集 analyse_dataset 对 NER 标注语料,分为训练集、验证集、测试集,并给出各个子集的实体类型分布统计

6.文本分类

功能 函数 描述
朴素贝叶斯分析类别词汇 analyse_freq_words 对文本分类的标注语料,做朴素贝叶斯词频分析,返回各类
文本的高条件概率词汇
分割数据集 analyse_dataset 对文本分类的标注语料,切分为训练集、验证集、测试集,
并给出各个子集的分类分布统计

7.情感分析

功能 函数 描述
基于词典情感分析 LexiconSentiment 依据人工构建的情感词典,计算文本的情感值,介于0~1之间

初衷

  • 开发 NLP 模型,预处理至关重要,常常占到工程师大部分时间。本工具包能快速辅助工程师完成各种琐碎的预处理操作,加速开发进度,把有限的精力用在思考而非 code 上。
  • 如有功能建议,可以通过 issue 提出。

做NLP不易,欢迎加入自然语言处理交流群 (#^.^#)

image

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

暂无描述 展开 收起
README
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/zhao-han-itp/JioNLP.git
git@gitee.com:zhao-han-itp/JioNLP.git
zhao-han-itp
JioNLP
JioNLP
master

搜索帮助