# UER-py
**Repository Path**: lduml/UER-py
## Basic Information
- **Project Name**: UER-py
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-01-02
- **Last Updated**: 2020-12-19
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# UER-py
[](https://travis-ci.org/dbiir/UER-py)
[](https://codebeat.co/projects/github-com-dbiir-uer-py-master)

Pre-training has become an essential part for NLP tasks and has led to remarkable improvements. UER-py (Universal Encoder Representations) is a toolkit for pre-training on general-domain corpus and fine-tuning on downstream task. UER-py maintains model modularity and supports research extensibility. It facilitates the use of different pre-training models (e.g. BERT, GPT, ELMO), and provides interfaces for users to further extend upon. With UER-py, we build a model zoo which contains pre-trained models based on different corpora, encoders, and targets.
#### Update: [BERT pretrained on mixed large Chinese corpus (bert-large 24-layers)](https://share.weiyun.com/5G90sMJ) is available now. The model is pretrained 500K steps upon RoBERTa-wwm-ext-large from https://github.com/ymcui/Chinese-BERT-wwm . It achieves SOTA results on [ChineseGLUE](http://106.13.187.75:8003/index). The detailed scripts are provided in fine-tuning section.
#### Update: [BERT pretrained on mixed large Chinese corpus (bert-base 12-layers)](https://share.weiyun.com/5QOzPqq) is available now.
#### Update: [ELMO pretrained on mixed large Chinese corpus](https://share.weiyun.com/5Qihztq) is available now. It is much faster than BERT and performs well on many classification datasets. One can fine-tune it with the following options: --encoder bilstm --config_path models/birnn_config.json --learning_rate 5e-4 --pooling mean
#### Update: [BERT-tiny 12x faster](https://share.weiyun.com/5J0oDBw) and [BERT-small 4x faster](https://share.weiyun.com/5nurvlT) pretrained on mixed large Chinese corpus are now available. One can use them by specifing --config_path models/bert_tiny_config.json or --config_path models/bert_small_config.json
Table of Contents
=================
* [Features](#features)
* [Requirements](#requirements)
* [Quickstart](#quickstart)
* [Datasets](#datasets)
* [Instructions](#instructions)
* [Scripts](#scripts)
* [Experiments](#experiments)
* [Chinese_model_zoo](#chinese_model_zoo)
## Features
UER-py has the following features:
- __Reproducibility.__ UER-py has been tested on several datasets and should match the performances of the original implementations.
- __Multi-GPU.__ UER-py supports CPU mode, single GPU mode, and distributed training mode.
- __Model modularity.__ UER-py is divided into multiple components: subencoder, encoder, target, and downstream task fine-tuning. Ample modules are implemented in each component. Clear and robust interface allows users to combine modules with as few restrictions as possible.
- __Efficiency.__ UER-py refines its pre-processing, pre-training, and fine-tuning stages, which largely improves speed and needs less memory.
- __Chinese model zoo.__ We are pre-training models with different corpora, encoders, and targets. Selecting proper pre-training models is beneficial to the performance of downstream tasks.
- __SOTA results.__ Our works further improve the results upon Google BERT, providing new baselines for a range of datasets.
## Requirements
Python3.6
torch>=1.0
argparse
## Quickstart
We use BERT model and [Douban book review classification dataset](https://embedding.github.io/evaluation/) to demonstrate how to use UER-py. We firstly pre-train model on book review corpus and then fine-tune it on classification dataset. There are three input files: book review corpus, book review dataset, and vocabulary. All files are encoded in UTF-8 and are included in this project.
The format of the corpus for BERT is as follows:
```
doc1-sent1
doc1-sent2
doc1-sent3
doc2-sent1
doc3-sent1
doc3-sent2
```
The book review corpus is obtained by book review dataset. We remove labels and split a review into two parts from the middle (See *book_review_bert.txt* in *corpora* folder).
The format of the classification dataset is as follows (label and instance are separated by \t):
```
label text_a
1 instance1
0 instance2
1 instance3
```
We use Google's Chinese vocabulary file, which contains 21128 Chinese characters. The format of the vocabulary is as follows:
```
word-1
word-2
...
word-n
```
First of all, we preprocess the book review corpus. We need to specify the model's target in pre-processing stage (--target):
```
python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt --dataset_path dataset.pt \
--processes_num 8 --target bert
```
Pre-processing is time-consuming. Multi-process can largely accelerate the pre-processing speed (--processes_num). The raw text is converted to dataset.pt, which is the input of pretrain.py. Then we download [Google's pre-trained Chinese model](https://share.weiyun.com/5s9AsfQ), and put it into *models* folder. We load Google's pre-trained model and train it on book review corpus. We should better explicitly specify model's encoder (--encoder) and target (--target). Suppose we have a machine with 8 GPUs.:
```
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_vocab.txt --pretrained_model_path models/google_zh_model.bin \
--output_model_path models/book_review_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 20000 --save_checkpoint_steps 5000 --encoder bert --target bert
mv models/book_review_model.bin-20000 models/book_review_model.bin
```
Notice that the model trained by *pretrain.py* is attacted with the suffix which records the training step. We could remove the suffix for ease of use.
Finally, we do classification. We can use *google_model.bin*:
```
python3 run_classifier.py --pretrained_model_path models/google_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 32 --encoder bert
```
or use our [*book_review_model.bin*](https://share.weiyun.com/52BEFs2), which is the output of pretrain.py:
```
python3 run_classifier.py --pretrained_model_path models/book_review_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 32 --encoder bert
```
It turns out that the result of Google's model is 87.5; The result of *book_review_model.bin* is 88.1. It is also noticable that we don't need to specify the target in fine-tuning stage. Pre-training target is replaced with task-specific target.
BERT consists of next sentence prediction (NSP) target. However, NSP target is not suitable for sentence-level reviews since we have to split a review into two parts. UER-py facilitates the use of different targets. Using masked language modeling (MLM) as target could be a properer choice for pre-training of reviews:
```
python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt --dataset_path dataset.pt \
--processes_num 8 --target mlm
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_model.bin \
--output_model_path models/book_review_mlm_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 20000 --save_checkpoint_steps 5000 --encoder bert --target mlm
mv models/book_review_mlm_model.bin-20000 models/book_review_mlm_model.bin
python3 run_classifier.py --pretrained_model_path models/book_review_mlm_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 32 --encoder bert
```
It turns out that the result of [*book_review_mlm_model.bin*](https://share.weiyun.com/5ScDjUO) is 88.3.
We could search proper pre-trained models in [Chinese model zoo](#chinese_model_zoo) for further improvements. For example, we could download [a model pre-trained on Amazon corpus (over 4 million reviews) with BERT encoder and classification (CLS) target](https://share.weiyun.com/5XuxtFA). It achieves 88.5 accuracy on book review dataset.
BERT is really slow. It could be great if we can speed up the model and still achieve competitive performance. We select a 2-layers LSTM encoder to substitute 12-layers Transformer encoder. We could download [a model pre-trained with LSTM encoder and language modeling (LM) + classification (CLS) targets](https://share.weiyun.com/5B671Ik):
```
python3 run_classifier.py --pretrained_model_path models/lstm_reviews_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/douban_book_review/train.tsv --dev_path datasets/douban_book_review/dev.tsv --test_path datasets/douban_book_review/test.tsv \
--epochs_num 3 --batch_size 64 --encoder lstm --pooling mean --config_path models/rnn_config.json --learning_rate 1e-3
```
We can achieve 86.5 accuracy on testset, which is also a competitive result. Using LSTM without pre-training can only achieve 80.2 accuracy. In practice, above model is around 10 times faster than BERT. One can see Chinese model zoo section for more detailed information about above pre-trained LSTM model.
Besides classification, UER-py also provides scripts for other downstream tasks. We could run_ner.py for named entity recognition:
```
python3 run_ner.py --pretrained_model_path models/google_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/msra_ner/train.tsv --dev_path datasets/msra_ner/dev.tsv --test_path datasets/msra_ner/test.tsv \
--epochs_num 5 --batch_size 16 --encoder bert
```
We could download [a model pre-trained on RenMinRiBao (as known as People's Daily, a news corpus)](https://share.weiyun.com/5JWVjSE) and finetune on it:
```
python3 run_ner.py --pretrained_model_path models/rmrb_model.bin --vocab_path models/google_zh_vocab.txt \
--train_path datasets/msra_ner/train.tsv --dev_path datasets/msra_ner/dev.tsv --test_path datasets/msra_ner/test.tsv \
--epochs_num 5 --batch_size 16 --encoder bert
```
It turns out that the result of Google's model is 92.6; The result of *rmrb_model.bin* is 94.4.
## Datasets
This project includes a range of Chinese datasets: XNLI, LCQMC, MSRA-NER, ChnSentiCorp, and NLPCC-DBQA are obtained from [Baidu ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE); Douban book review is obtained from [BNU](https://embedding.github.io/evaluation/); Online shopping review are organized by ourself; THUCNews is obtained from [here](https://github.com/gaussic/text-classification-cnn-rnn); Sina Weibo review is obtained from [here](https://github.com/SophonPlus/ChineseNlpCorpus); More Large-scale datasets can be found in [glyph's github project](https://github.com/zhangxiangxiao/glyph).
Script | Function description
|
---|
average_model.py | Take the average of pre-trained models. A frequently-used ensemble strategy for deep learning models
|
build_vocab.py | Build vocabulary (multi-processing supported)
|
check_model.py | Check the model (single GPU or multiple GPUs)
|
cloze_test.py | Randomly mask a word and predict it, top n words are returned
|
convert_bert_from_uer_to_google.py | convert the BERT of UER format to Google format (TF)
|
convert_bert_from_uer_to_huggingface.py | convert the BERT of UER format to Huggingface format (PyTorch)
|
convert_bert_from_google_to_uer.py | convert the BERT of Google format (TF) to UER format
|
convert_bert_from_huggingface_to_uer.py | convert the BERT of Huggingface format (PyTorch) to UER format
|
diff_vocab.py | Compare two vocabularies
|
dynamic_vocab_adapter.py | Change the pre-trained model according to the vocabulary. It can save memory in fine-tuning stage since task-specific vocabulary is much smaller than general-domain vocabulary
|
extract_embedding.py | extract the embedding of the pre-trained model
|
extract_feature.py | extract the hidden states of the last of the pre-trained model
|
topn_words_indep.py | Finding nearest neighbours with context-independent word embedding
|
topn_words_dep.py | Finding nearest neighbours with context-dependent word embedding
|
### Cloze test
cloze_test.py predicts masked words. Top n words are returned.
```
usage: cloze_test.py [-h] [--pretrained_model_path PRETRAINED_MODEL_PATH]
[--vocab_path VOCAB_PATH] [--input_path INPUT_PATH]
[--output_path OUTPUT_PATH] [--config_path CONFIG_PATH]
[--batch_size BATCH_SIZE] [--seq_length SEQ_LENGTH]
[--encoder {bert,lstm,gru,cnn,gatedcnn,attn,rcnn,crnn,gpt}]
[--bidirectional] [--target {bert,lm,cls,mlm,nsp,s2s}]
[--subword_type {none,char}] [--sub_vocab_path SUB_VOCAB_PATH]
[--subencoder_type {avg,lstm,gru,cnn}]
[--tokenizer {bert,char,word,space}] [--topn TOPN]
```
The example of using cloze_test.py:
```
python3 scripts/cloze_test.py --input_path datasets/cloze_input.txt --pretrained_model_path models/google_zh_model.bin \
--vocab_path models/google_vocab.txt --output_path output.txt
```
### Feature extractor
extract_feature.py extracts hidden states of the last encoder layer.
```
usage: extract_feature.py [-h] --input_path INPUT_PATH --pretrained_model_path
PRETRAINED_MODEL_PATH --vocab_path VOCAB_PATH
--output_path OUTPUT_PATH [--seq_length SEQ_LENGTH]
[--batch_size BATCH_SIZE]
[--config_path CONFIG_PATH]
[--embedding {bert,word}]
[--encoder {bert,lstm,gru,cnn,gatedcnn,attn,rcnn,crnn,gpt}]
[--bidirectional] [--subword_type {none,char}]
[--sub_vocab_path SUB_VOCAB_PATH]
[--subencoder {avg,lstm,gru,cnn}]
[--sub_layers_num SUB_LAYERS_NUM]
[--tokenizer {bert,char,space}]
```
The example of using extract_feature.py:
```
python3 scripts/extract_feature.py --input_path datasets/cloze_input.txt --vocab_path models/google_zh_vocab.txt \
--pretrained_model_path models/google_model.bin --output_path feature_output.pt
```
### Finding nearest neighbours
Pre-trained models can learn high-quality word embeddings. Traditional word embeddings such as word2vec and GloVe assign each word a fixed vector (context-independent word embedding). However, polysemy is a pervasive phenomenon in human language, and the meanings of a polysemous word depend on the context. To this end, we use a the hidden state in pre-trained models to represent a word. It is noticeable that Google BERT is a character-based model. To obtain real word embedding (not character embedding), Users should download our [word-based BERT model](https://share.weiyun.com/5s4HVMi) and [vocabulary](https://share.weiyun.com/5NWYbYn).
The example of using scripts/topn_words_indep.py to find nearest neighbours for context-independent word embedding (character-based and word-based models):
```
python3 scripts/topn_words_indep.py --pretrained_model_path models/google_model.bin --vocab_path models/google_zh_vocab.txt \
--cand_vocab_path models/google_zh_vocab.txt --target_words_path target_words.txt
python3 scripts/topn_words_indep.py --pretrained_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
--cand_vocab_path models/wiki_word_vocab.txt --target_words_path target_words.txt
```
Context-independent word embedding is obtained by model's embedding layer.
The format of the target_words.txt is as follows:
```
word-1
word-2
...
word-n
```
The example of using scripts/topn_words_dep.py to find nearest neighbours for context-dependent word embedding (character-based and word-based models):
```
python3 scripts/topn_words_dep.py --pretrained_model_path models/google_model.bin --vocab_path models/google_zh_vocab.txt \
--cand_vocab_path models/google_zh_vocab.txt --sent_path target_words_with_sentences.txt --config_path models/bert_base_config.json \
--batch_size 256 --seq_length 32 --tokenizer bert
python3 scripts/topn_words_dep.py --pretrained_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
--cand_vocab_path models/wiki_word_vocab.txt --sent_path target_words_with_sentences.txt --config_path models/bert_base_config.json \
--batch_size 256 --seq_length 32 --tokenizer space
```
We substitute the target word with other words in the vocabulary and feed the sentences into the pretrained model. Hidden state is used as the context-dependent embedding of a word. Users should do word segmentation manually and use space tokenizer if word-based model is used. The format of
target_words_with_sentences.txt is as follows:
```
sent1 word1
sent1 word1
...
sentn wordn
```
Sentence and word are splitted by \t.
### Text generator
We could use *generate.py* to generate text. Given a few words or sentences, *generate.py* can continue writing. The example of using *generate.py*:
```
python3 scripts/generate.py --pretrained_model_path models/gpt_model.bin --vocab_path models/google_zh_vocab.txt
--input_path story_beginning.txt --output_path story_full.txt --config_path models/bert_base_config.json
--encoder gpt --target lm --seq_length 128
```
where *story_beginning* contains the beginning of a text. One can use any models pre-trained with LM target, such as [GPT trained on mixed large corpus](https://share.weiyun.com/51nTP8V). By now we only provide a vanilla version of generator. More mechanisms will be added for better performance and efficiency.
pre-trained model | Link | Description
|
---|
Wikizh+BertEncoder+BertTarget | https://share.weiyun.com/5s9AsfQ | The training corpus is Wiki_zh, trained by Google
|
Wikizh(word-based)+BertEncoder+BertTarget | Model: https://share.weiyun.com/5s4HVMi Vocab: https://share.weiyun.com/5NWYbYn | Word-based BERT model trained on Wikizh. Training steps: 500,000
|
RenMinRiBao+BertEncoder+BertTarget | https://share.weiyun.com/5JWVjSE | The training corpus is news data from People's Daily (1946-2017). It is suitable for datasets related with news, e.g. F1 is improved on MSRA-NER from 92.6 to 94.4 (compared with Google BERT). Training steps: 500,000
|
Webqa2019+BertEncoder+BertTarget | https://share.weiyun.com/5HYbmBh | The training corpus is WebQA, which is suitable for datasets related with social media, e.g. Accuracy (dev/test) on LCQMC is improved from 88.8/87.0 to 89.6/87.4; Accuracy (dev/test) on XNLI is improved from 78.1/77.2 to 79.0/78.8 (compared with Google BERT). Training steps: 500,000
|
Weibo+BertEncoder+BertTarget | https://share.weiyun.com/5ZDZi4A | The training corpus is Weibo. Training steps: 200,000
|
Mixedlarge corpus+GptEncoder+LmTarget | https://share.weiyun.com/51nTP8V | Mixedlarge corpus contains baidubaike + wiki + webqa + RenMinRiBao + literature + reviews. Training steps: 500,000 (with sequence lenght of 128) + 100,000 (with sequence length of 512)
|
Google-BERT-en-uncased-base | Model: https://share.weiyun.com/5hWivED Vocab: https://share.weiyun.com/5gBxBYD | Provided by Google.
|
Google-BERT-en-cased-base | Model: https://share.weiyun.com/5SltATz Vocab: https://share.weiyun.com/5ouUo2q | Provided by Google.
|
Reviews+LstmEncoder+LmTarget | https://share.weiyun.com/57dZhqo | The training corpus is amazon reviews + JDbinary reviews + dainping reviews (11.4M reviews in total). Language model target is used. It is suitable for datasets related with reviews. It achieves over 5 percent improvements on some review datasets compared with random initialization. Set hidden_size in models/rnn_config.json to 512 before using it. Training steps: 200,000; Sequence length: 128;
|
(Mixedlarge corpus & Amazon reviews)+LstmEncoder+(LmTarget & ClsTarget) | https://share.weiyun.com/5B671Ik | Mixedlarge corpus contains baidubaike + wiki + webqa + RenMinRiBao. The model is trained on it with language model target. And then the model is trained on Amazon reviews with language model and classification targets. It is suitable for datasets related with reviews. It can achieve comparable results with BERT on some review datasets. Training steps: 500,000 + 100,000; Sequence length: 128
|
IfengNews+BertEncoder+BertTarget | https://share.weiyun.com/5HVcUWO | The training corpus is news data from Ifeng website. We use news titles to predict news abstracts. Training steps: 100,000; Sequence length: 128
|
jdbinary+BertEncoder+ClsTarget | https://share.weiyun.com/596k2bu | The training corpus is review data from JD (jingdong). Classification target is used for pre-training. It is suitable for datasets related with shopping reviews, e.g. accuracy is improved on shopping datasets from 96.3 to 97.2 (compared with Google BERT). Training steps: 50,000; Sequence length: 128
|
jdfull+BertEncoder+MlmTarget | https://share.weiyun.com/5L6EkUF | The training corpus is review data from JD (jingdong). Masked LM target is used for pre-training. Training steps: 50,000; Sequence length: 128
|
Amazonreview+BertEncoder+ClsTarget | https://share.weiyun.com/5XuxtFA | The training corpus is review data from Amazon (including book reviews, movie reviews, and etc.). Classification target is used for pre-training. It is suitable for datasets related with reviews, e.g. accuracy is improved on Douban book review datasets from 87.6 to 88.5 (compared with Google BERT). Training steps: 20,000; Sequence length: 128
|
XNLI+BertEncoder+ClsTarget | https://share.weiyun.com/5oXPugA | Infersent with BertEncoder
|
| |
|
We release the classification models on 5 large-scale datasets, i.e. Ifeng, Chinanews, Dianping, JDbinary, and
JDfull. Users can use these models to reproduce results, or regard them as pre-training models for other datasets.