# NLP_PEMDC **Repository Path**: coisini667/NLP_PEMDC ## Basic Information - **Project Name**: NLP_PEMDC - **Description**: NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating. - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2022-01-05 - **Last Updated**: 2022-01-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # NLP_PEMDC **NLP** **P**redtrained **E**mbeddings, **M**odels and **D**atasets **C**ollections(**NLP_PEMDC**) The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only. 不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。 The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know. 排名不分先后,仅按我添加的先后顺序。数据集所有权均属于原作者,感谢!若有侵权,请电邮我告知删除。 # Pretrained Chinese Word Vectors(embeddings): ## Word2vec 1. > ### 100+ Chinese Word Vectors 上百种预训练中文词向量 > [Github](https://github.com/Embedding/Chinese-Word-Vectors) 2. > ### Tencent AI Lab Embedding Corpus for Chinese Words and Phrases > [URL](https://ai.tencent.com/ailab/nlp/embedding.html) ## GloVe TODO # Chinese Pre-trained Models 1. > ### Chinese-BERT > [Github](https://github.com/google-research/bert) 1. > ### Chinese-BERT-wwm > > [Github](https://github.com/ymcui/Chinese-BERT-wwm) 1. >### Chinese-XLNet > >[Github1](https://github.com/brightmart/xlnet_zh) > >[Github2](https://github.com/ymcui/Chinese-PreTrained-XLNet) 1. >### Chinese-RoBERTa > >[Github](https://github.com/brightmart/roberta_zh) 1. > ### Chinese-ALBERT > > [Github1](https://github.com/brightmart/albert_zh) > > [Github2](https://github.com/google-research/ALBERT) # Chinese Courpus: 1. > ### [集合]大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP > [Github](https://github.com/brightmart/nlp_chinese_corpus) 2. > ### [集合]搜狗实验室语料集合 > [语料数据](http://www.sogou.com/labs/resource/list_yuliao.php) 3. > ### [集合]ChineseNlpCorpus > > [Github](https://github.com/SophonPlus/ChineseNlpCorpus) 4. > ### [集合]ChineseGLUE > > [Github]( https://github.com/chineseGLUE/chineseGLUE ) > > 目前包含: > > 1. ##### LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task [COLING 2018](https://www.aclweb.org/anthology/C18-1166/) > > 2. ##### XNLI 语言推断任务 Natural Language Inference [EMNLP 2015](https://www.aclweb.org/anthology/D15-1075/) > > 3. ##### TNEWS 今日头条中文新闻(短文本)分类 Short Text Classificaiton for News > > 4. ##### INEWS 互联网情感分析任务 Sentiment Analysis for Internet News > > 5. ##### THUCNEWS 长文本分类 Long Text classification > > 6. ##### iFLYTEK 长文本分类 Long Text classification > > 7. ##### DRCD 繁体阅读理解任务 Reading Comprehension for Traditional Chinese > > 8. ##### CMRC2018 简体中文阅读理解任务 Reading Comprehension for Simplified Chinese > > 9. ##### BQ 智能客服问句匹配 Question Matching for Customer Service [EMNLP 2018](https://www.aclweb.org/anthology/D18-1536/) [Download](http://icrc.hitsz.edu.cn/Article/show/175.html) > > 10. ##### MSRANER 命名实体识别 Name Entity Recognition > > 11. ##### CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test > > 12. ##### CMNLI 语言推理任务 Chinese Multi-Genre NLI 5. > ### LCSTS: A Large Scale Chinese Short Text Summarization Dataset > > 大规模中文短文本摘要数据集 > > [arXiv]( https://arxiv.org/abs/1506.05865 ) > > [Download]( http://icrc.hitsz.edu.cn/Article/show/139.html ) 6. > ### chinese-poetry: 最全中文诗歌古典文集数据库 > [Github](https://github.com/chinese-poetry/chinese-poetry) 7. > ### SentiBridge: 中文实体情感知识库 > [Github](https://github.com/rainarch/SentiBridge) # English Corpus: 1. > ### [collections]GLUE > > [Download]( https://gluebenchmark.com/tasks ) > > **Including:** > > 1. The Corpus of Linguistic Acceptability > 2. The Stanford Sentiment Treebank > 3. Microsoft Research Paraphrase Corpus > 4. Semantic Textual Similarity Benchmark > 5. Quora Question Pairs > 6. MultiNLI Matched > 7. MultiNLI Mismatched > 8. Question NLI > 9. Recognizing Textual Entailment > 10. Winograd NLI > 11. Diagnostics Main 2. > ### [collections]SuperGLUE > > [Download](https://super.gluebenchmark.com/tasks) > > **Including:** > > 1. Broadcoverage Diagnostics > 2. CommitmentBank > 3. Choice of Plausible Alternatives > 4. Multi-Sentence Reading Comprehension > 5. Recognizing Textual Entailment > 6. Words in Context > 7. The Winograd Schema Challenge > 8. BoolQ > 9. Reading Comprehension with Commonsense Reasoning > 10. Winogender Schema Diagnostics 3. > ### IMDB Large Movie Review Dataset > > This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. > > [Download](https://ai.stanford.edu/~amaas/data/sentiment/) 4. > ### SQuAD2.0 > > The Stanford Question Answering Dataset > > [Website](https://rajpurkar.github.io/SQuAD-explorer/)