代码拉取完成,页面将自动刷新
Pytorch-NLU是一个只依赖pytorch、transformers、numpy、tensorboardX,专注于文本分类、序列标注的极简自然语言处理工具包。 支持BERT、ERNIE、ROBERTA、NEZHA、ALBERT、XLNET、ELECTRA、GPT-2、TinyBERT、XLM、T5等预训练模型; 支持BCE-Loss、Focal-Loss、Circle-Loss、Prior-Loss、Dice-Loss、LabelSmoothing等损失函数; 具有依赖轻量、代码简洁、注释详细、调试清晰、配置灵活、拓展方便、适配NLP等特性。
pip install Pytorch-NLU
# 清华镜像源
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple Pytorch-NLU
免责声明:以下数据集由公开渠道收集而成, 只做汇总说明; 科学研究、商用请联系原作者; 如有侵权, 请及时联系删除。
1. 文本分类 (txt格式, 每行为一个json):
多类分类格式:
{"text": "人站在地球上为什么没有头朝下的感觉", "label": "教育"}
{"text": "我的小baby", "label": "娱乐"}
{"text": "请问这起交通事故是谁的责任居多小车和摩托车发生事故在无红绿灯", "label": "娱乐"}
多标签分类格式:
{"label": "3|myz|5", "text": "课堂搞东西,没认真听"}
{"label": "3|myz|2", "text": "测验90-94.A-"}
{"label": "3|myz|2", "text": "长江作业未交"}
2. 序列标注 (txt格式, 每行为一个json):
SPAN格式如下:
{"label": [{"type": "ORG", "ent": "市委", "pos": [10, 11]}, {"type": "PER", "ent": "张敬涛", "pos": [14, 16]}], "text": "去年十二月二十四日,市委书记张敬涛召集县市主要负责同志研究信访工作时,提出三问:『假如上访群众是我们的父母姐妹,你会用什么样的感情对待他们?"}
{"label": [{"type": "PER", "ent": "金大中", "pos": [5, 7]}], "text": "今年2月,金大中新政府成立后,社会舆论要求惩治对金融危机负有重大责任者。"}
{"label": [], "text": "与此同时,作者同一题材的长篇侦破小说《鱼孽》也出版发行。"}
CONLL格式如下:
青 B-ORG
岛 I-ORG
海 I-ORG
牛 I-ORG
队 I-ORG
和 O
更多样例sample详情见/test目录
# !/usr/bin/python
# -*- coding: utf-8 -*-
# !/usr/bin/python
# -*- coding: utf-8 -*-
# @time : 2021/2/23 21:34
# @author : Mo
# @function: 多标签分类, 根据label是否有|myz|分隔符判断是多类分类, 还是多标签分类
# 适配linux
import platform
import json
import sys
import os
path_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))
sys.path.append(os.path.join(path_root, "pytorch_textclassification"))
print(path_root)
# 分类下的引入, pytorch_textclassification
from tcTools import get_current_time
from tcRun import TextClassification
from tcConfig import model_config
evaluate_steps = 320 # 评估步数
save_steps = 320 # 存储步数
# pytorch预训练模型目录, 必填
pretrained_model_name_or_path = "bert-base-chinese"
# 训练-验证语料地址, 可以只输入训练地址
path_corpus = os.path.join(path_root, "corpus", "text_classification", "school")
path_train = os.path.join(path_corpus, "train.json")
path_dev = os.path.join(path_corpus, "dev.json")
if __name__ == "__main__":
model_config["evaluate_steps"] = evaluate_steps # 评估步数
model_config["save_steps"] = save_steps # 存储步数
model_config["path_train"] = path_train # 训练模语料, 必须
model_config["path_dev"] = path_dev # 验证语料, 可为None
model_config["path_tet"] = None # 测试语料, 可为None
# 损失函数类型,
# multi-class: 可选 None(BCE), BCE, BCE_LOGITS, MSE, FOCAL_LOSS, DICE_LOSS, LABEL_SMOOTH
# multi-label: SOFT_MARGIN_LOSS, PRIOR_MARGIN_LOSS, FOCAL_LOSS, CIRCLE_LOSS, DICE_LOSS等
model_config["path_tet"] = "FOCAL_LOSS"
os.environ["CUDA_VISIBLE_DEVICES"] = str(model_config["CUDA_VISIBLE_DEVICES"])
model_config["pretrained_model_name_or_path"] = pretrained_model_name_or_path
model_config["model_save_path"] = "../output/text_classification/model_{}".format(model_type[idx])
model_config["model_type"] = "BERT"
# main
lc = TextClassification(model_config)
lc.process()
lc.train()
# 适配linux
import platform
import json
import sys
import os
path_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "../.."))
path_sys = os.path.join(path_root, "pytorch_sequencelabeling")
sys.path.append(path_sys)
print(path_root)
print(path_sys)
# 分类下的引入, pytorch_textclassification
from slTools import get_current_time
from slRun import SequenceLabeling
from slConfig import model_config
evaluate_steps = 320 # 评估步数
save_steps = 320 # 存储步数
# pytorch预训练模型目录, 必填
pretrained_model_name_or_path = "bert-base-chinese"
# 训练-验证语料地址, 可以只输入训练地址
path_corpus = os.path.join(path_root, "corpus", "sequence_labeling", "ner_china_people_daily_1998_conll")
path_train = os.path.join(path_corpus, "train.conll")
path_dev = os.path.join(path_corpus, "dev.conll")
if __name__ == "__main__":
model_config["evaluate_steps"] = evaluate_steps # 评估步数
model_config["save_steps"] = save_steps # 存储步数
model_config["path_train"] = path_train # 训练模语料, 必须
model_config["path_dev"] = path_dev # 验证语料, 可为None
model_config["path_tet"] = None # 测试语料, 可为None
# 一种格式 文件以.conll结尾, 或者corpus_type=="DATA-CONLL"
# 另一种格式 文件以.span结尾, 或者corpus_type=="DATA-SPAN"
model_config["corpus_type"] = "DATA-CONLL"# 语料数据格式, "DATA-CONLL", "DATA-SPAN"
model_config["task_type"] = "SL-CRF" # 任务类型, "SL-SOFTMAX", "SL-CRF", "SL-SPAN"
model_config["dense_lr"] = 1e-3 # 最后一层的学习率, CRF层学习率/全连接层学习率, 1e-5, 1e-4, 1e-3
model_config["lr"] = 1e-5 # 学习率, 1e-5, 2e-5, 5e-5, 8e-5, 1e-4, 4e-4
model_config["max_len"] = 156 # 最大文本长度, None和-1则为自动获取覆盖0.95数据的文本长度, 0则取训练语料的最大长度, 具体的数值就是强制padding到max_len
model_config["pretrained_model_name_or_path"] = pretrained_model_name_or_path
model_config["model_save_path"] = "../output/sequence_labeling/model_{}".format(model_type[idx])
model_config["model_type"] = model_type[idx]
# main
lc = SequenceLabeling(model_config)
lc.process()
lc.train()
This library is inspired by and references following frameworks and papers.
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@software{Pytorch-NLU,
url = {https://github.com/yongzhuo/Pytorch-NLU},
author = {Yongzhuo Mo},
title = {Pytorch-NLU},
year = {2021}
*希望对你有所帮助!
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。