# Contents
We release the code to explore the new front-edge of training large model with billions or even trillions of parameters. By MindSpore's parallel feature, we adopt the efficient model parallel and data parallel technology such as operator level parallelism, to minimize the communication cost and maximize computation efficiency. The code is easy to scale to thousands of NPUs and trillion parameters with little modifications.
In the mean while, we run our parallel training upon a language model, named PanGu-Alpha, to demonstrate the large model can be trained easily with our parallel setting. We summarized the training tricks as followings:
The above features can be found here. More amazing features are still under developing.
The technical report and checkpoint file can be found here.
The architecture of PanGu-α is based on Transformer, which has been extensively used as the backbone of a variety of pretrained language models such as BERT and GPT. Different from them, we develop an additional query layeron top of Transformer layers to predict the next token. The diagram of the model is shown in Figure 1.
The above dataset is preprocessed with 1024 tokens for each example. The default column key in dataset.py is input_ids
.
The following table gives a description of the tested environment, scripts and MindSpore version. Note the model supports only the graph mode.
Parallel Mode | MindSpore Version | GPU(V100) | Ascend (Ascend 910) |
---|---|---|---|
data parallel | 1.5.0 | Supported | Supported |
model parallel | 1.5.0 | Supported | Supported |
optimizer parallel | 1.5.0 | Supported | Supported |
recompute | 1.5.0 | Supported | Supported |
pipeline parallel | 1.5.0 | Not Supported | Supported |
To obtain the pangu_alpha's script, you need git
to clone the mindspore's code as followings:
git clone https://gitee.com/mindspore/mindspore.git -b master
cd mindspore/model_zoo/official/nlp/pangu_alpha
For requirements, please refer to Requirements to install the dependency.
As the format of the downstream tasks can be various, the preprocess.py
provides a basic usage of how to process your raw text files. Please prepare your data with following format, each line is a piece of continuous text for each file:
今天是一个好天气,小明很高兴的背起书包上学去。但是...
突然刮起了狂风暴雨!
Suppose the text data is under the ./data
and each text file ends with 'txt', we can run the following command to generate the mindrecord files with seq_length=1025.
python -m src.preprocess --input_glob 'data/*.txt' --tokenizer gpt --eot 50256 --data_column_name input_ids --seq_length 1025
The script will chunk the each line with 1025 tokens. For the chunk with no more 1025 tokens, the chunk will be ignored.
The output files is under ./output
. The default tokenizer adopts the transformers's tokenizer. Note the vocab_szie
is determined by the vocab file.
gpt
(required transformers
) or jieba
. Note the gpt
tokenizer requires the transformers
,pytorch
or tensorflow
. jieba
tokenizer requires two addition files vocab.model
and vocab.vocab
. Click here to download them.end of the document
.For users who want to do incremental training on the ckpts released by PCL-Platform, please download the vocab.model
and vocab.vocab
from here. Then run the following command to tokenize the raw text with same vocab used for pre-training (using jieba tokenizer).
python -m src.preprocess --input_glob data/*.txt --tokenizer jieba --vocab_file vocab.vocab --model_file vocab.model --eot 6
The vocab size of vocab.vocab
is 40000, and the eod id
is 6.
Currently the scripts provide three default configures : 2.6B
13B
and 200B
. The following command will start training 2.6B
model on 8 Ascend cards.
# run distributed training example
bash scripts/run_distribute_train.sh DATASET RANK_TABLE RANK_SIZE TYPE MODE STAGE_NUM MICRO_SIZE PER_BATCH RANK_START
#example:
bash scripts/run_distribute_train.sh /data/pangu_30_step_ba64/ /root/hccl_8p.json 8 fp32 2.6B 1 1 8 0 8
The above command involves some args
described below:
/home/work/mindrecord/
.device id
, service ip
and rank
.fp16
. This will save a little memory used on the device.hidden size
and layers
to make the parameter number near 2.6 billions. The other mode can be 13B
(hidden size
5120 and layers
40, which needs at least 16 cards to train.) and 200B
.stage_num
is large than 1, the pipeline parallel mode would be applied. This configure indicates the number of sub graphs in pipeline parallel mode.stage_num
.The following command will launch he program will train 2.6B model with the following command:
# run distributed training example in one ascend machine
bash scripts/run_distribute_train.sh /path/dataset /path/hccl.json 8 fp32 2.6B 1 1 8 0 8
# run distributed training example in two ascend machine
# machine A
bash scripts/run_distribute_train.sh /path/dataset /path/hccl.json 16 fp32 2.6B 2 4 8 0 8
# machine B
bash scripts/run_distribute_train.sh /path/dataset /path/hccl.json 16 fp32 2.6B 2 4 8 8 8
For distributed training, an hccl configuration file with JSON format needs to be created in advance. Please follow the instructions in the link below: https:gitee.com/mindspore/models/tree/r1.5/utils/hccl_tools.
The script will launch the GPU training through mpirun
, the user can run the following command on any machine to start training.
bash scripts/run_distributed_train_gpu.sh RANK_SIZE HOSTFILE DATASET PER_BATCH MOD
/home/work/mindrecord/
.2.6B
, 13B
and 200B
.Currently the scripts provide three default configures : 2.6B
13B
and 200B
. Currently, only support device target Ascend.
# run distributed training in example
bash scripts/run_distribute_train_moe_host_device.sh DATASET RANK_TABLE RANK_SIZE TYPE MODE STAGE_NUM MICRO_SIZE PER_BATCH RANK_START LOCAL_DEVICE_NUM EXPERT_NUM_PER_EP
The above command involves some args
described below:
/home/work/mindrecord/
.device id
, service ip
and rank
.fp16
. This will save a little memory used on the device.hidden size
and layers
to make the parameter number near 2.6 billions. The other mode can be 13B
(hidden size
5120 and layers
40, which needs at least 16 cards to train.) and 200B
.stage_num
is large than 1, the pipeline parallel mode would be applied. This configure indicates the number of sub graphs in pipeline parallel mode.stage_num
.The following command will launch he program will train 60B model using 8 NPU. Mode 2.6B only represents the model configuration is same with 2.6B model which is without MoE. Running 0B model using 8 NPU in one server requires that the server has at least 1T host memory.
# run distributed training example in one ascend machine
bash run_distributed_train_moe_host_device.sh /path/dataset /path/hccl.json 8 fp32 2.6B 1 1 1 0 8 36
Before we start Incremental Training, the following two steps must be done:
checkpoint
and strategy
file according to the [Download Checkpoint](#Download Checkpoint). Each host should own the complete checkpoint files.Then run the following command to start incremental training with 2.6B
configure:
export FILE_PATH=/home/your_path/ckpts
bash scripts/run_distribute_incremental_train.sh DATASET RANK_TABLE 8 fp32 2.6B 8 ${FILE_PATH}/strategy_load_ckpt/strategy.ckpt ${FILE_PATH}/checkpoint_file filitered
Please refer to the website to download the following parts:
Here we suppose the downloaded checkpoint, tokenizer and strategy file is organized as follows:
ckpts
├── checkpoint_file
│ ├── filtered_*.ckpt
│ ├── word_embedding.npy
│ ├── top_query_embedding.npy
│ └── position_embedding.npy
├── strategy_load_ckpt
│ └── strategy.ckpt
└── tokenizer
├── vocab10.model
└── vocab10.vocab
We provide two predict methods. The first one is the normal way which needs to pad the input to a certain length every iteration. Due to the redundant calculation, the latency of this method is quite high and to accelerate the speed performance, we provide the second state reuse (incremental inference) method.
The state reuse method is the default mode, and you can disable it by changing the argument 'use_past' to False.
The following script will run prediction on 8 Ascend cards.
export FILE_PATH=/home/your_path/ckpts
bash scripts/run_distribute_predict.sh 8 /home/config/rank_table_8p.json ${FILE_PATH}/strategy_load_ckpt/strategy.ckpt \
${FILE_PATH}/tokenizer/ ${FILE_PATH}/checkpoint_file filitered 2.6B fp32
The following script will run prediction on 1 Ascend cards or 1 Nvidia GPU. The difference is the net is initialized with float16 type.
$FILE_PATH=/home/your_path/ckpts
$DEVICE_TARGET=Ascend # or GPU
bash scripts/run_standalone_predict.sh ${FILE_PATH}/strategy_load_ckpt/strategy.ckpt \
${FILE_PATH}/tokenizer/ ${FILE_PATH}/checkpoint_file filitered 2.6B $DEVICE_TARGET
Pip install MindSpore and MindSpore Serving 1.5 or later.
Pip install flask, flask-apscheduler, jieba, sentencepiece and other whl package if needed.
Download PanGu-Alpha repository, we will need
pangu-alpha/strategy_load_ckpt
and pangu-alpha/tokenizer
in the following process.
Download 13B or 2.6B checkpoint files and *embedding
files
from PanGu-Alpha repository.
For 13B, we will need 13B_part0
to 13B_part3
, 13B_word_embedding
, 13B_top_query_embedding
, 13B_position_embedding
.
For 2.6B, we will need 2.6B_part0
to 2.6B_part3
, 13B_word_embedding
, 2.6B_top_query_embedding
, 2.6B_position_embedding
.
Decompress all the 13B_part*
or 2.6B_part*
tar files and a large number of *ckpt
files will generate. Move
all *embedding
to the same directory of *.ckpt
files.
Use scripts/run_standalone_export.sh to export MindIR models, and move all device_0/* to 'serving_increment/pangu_standalone/pangu/1/'.
>>> cd scripts
>>> bash run_standalone_export.sh ${strategy_file_path} ${ckpt_dir_path}
Update the parameter MODE
in run_standalone_export.sh
from 13B
to 2.6B
if we want to export 2.6B model.
Update the parameter DEVICE_TARGET
in run_standalone_export.sh
from Ascend
to GPU
when running in GPU environment.
The ${strategy_file_path}
is file path of pangu-alpha/strategy_load_ckpt/angu_alpha_13B_cktp_strategy.ckpt
for 13B
and pangu-alpha/strategy_load_ckpt/angu_alpha_2.6B_cktp_strategy.ckpt
for 2.6B.
The ${ckpt_dir_path}
is the directory the *ckpt
files generated by decompression and *embedding
files.
The model will be exported for some minutes. Check log device_0/log0.log, confirm that there is no exception at last. Confirm that mindir files have been generated in device_0/ which means that the model is exported successfully.
>>> ls device_0
pangu_alpha_1024_graph.mindir pangu_alpha_1024_variables pangu_alpha_1_graph.mindir pangu_alpha_1_variables
>>> cd - && mkdir serving_increment/pangu_standalone/pangu/1/
>>> mv scripts/device_0/* serving_increment/pangu_standalone/pangu/1/
>>> cd serving_increment
Copy pangu-alpha/tokenizer
to directory serving_increment/pangu_standalone/pangu/tokenizer.
The directory hierarchy of the required files is shown below. The pangu_alpha_1024_variables and pangu_alpha_1_variables are folded for easy display.
>>> tree pangu_distributed
pangu_standalone/
├── pangu
│ ├── 1
│ │ ├── pangu_alpha_1024_graph.mindir
│ │ ├── pangu_alpha_1024_variables/
│ │ ├── pangu_alpha_1_graph.mindir
│ │ └── pangu_alpha_1_variables/
│ ├── servable_config.py
│ ├── tokenization_jieba.py
│ └── tokenizer
│ ├── vocab.model
│ └── vocab.vocab
└── serving_server.py
Run bash start_pangu_standalone.sh
to start new execution, and wait until the serving and flask server are started
successfully.
If any error happened, log can be viewed in serving_server.log, serving_logs/*.log and flask.log.
If anything all right, access address {ip}:5000 in one browser. It will take some time to return the reply.
Run bash stop_pangu.sh
to stop the existing execution.
Generate rank table file.
# models/utils/hccl_tools/hccl_tools.
>>> python3 models/utils/hccl_tools/hccl_tools.py --device_num "[0,8]"
>>> mv hccl_8p_01234567*.json serving_increment/pangu_distributed/hccl_8p.json
Use scripts/run_distribute_export.sh to export MindIR models, and move all device* to 'serving_increment/pangu_distributed/models/'.
>>> cd scripts
>>> bash run_distribute_export.sh ${strategy_file_path} ${ckpt_dir_path}
Update the parameter MODE
in run_distribute_export.sh
from 13B
to 2.6B
if we want to export 2.6B model.
The ${strategy_file_path}
is file path of pangu-alpha/strategy_load_ckpt/angu_alpha_13B_cktp_strategy.ckpt
for 13B
and pangu-alpha/strategy_load_ckpt/angu_alpha_2.6B_cktp_strategy.ckpt
for 2.6B.
The ${ckpt_dir_path}
is the directory the *ckpt files generated by decompression and *embedding files.
The model will be exported for some minutes. Check log device_[0-7]/log[0-7].log, confirm that there is no exception at last. Confirm that mindir files have been generated in device_[0-7]/ which means that the model is exported successfully.
>>> cd - && mkdir serving_increment/pangu_distributed/models/
>>> mv scripts/device_* serving_increment/pangu_distributed/models/
>>> cd serving_increment
Update MindIR file name serving_increment/pangu_distributed/serving_agent.py if needed.
Copy pangu-alpha/tokenizer
to directory serving_increment/pangu_distributed/pangu/tokenizer.
The directory hierarchy of the required files is shown below. The device_1 to device_7 are folded for easy display.
>>> tree pangu_distributed
pangu_distributed/
├── hccl_8p.json
├── models
│ ├── device_0
│ │ ├── pangu_alpha_1024_graph.mindir
│ │ ├── pangu_alpha_1024_variables
│ │ │ ├── data_0
│ │ │ ├── data_1
│ │ │ ├── data_2
│ │ │ ├── data_3
│ │ │ └── data_4
│ │ ├── pangu_alpha_1_graph.mindir
│ │ └── pangu_alpha_1_variables
│ │ ├── data_0
│ │ ├── data_1
│ │ ├── data_2
│ │ ├── data_3
│ │ └── data_4
│ ├── device_1/
│ ├── device_2/
│ ├── device_3/
│ ├── device_4/
│ ├── device_5/
│ ├── device_6/
│ └── device_7/
├── pangu
│ ├── servable_config.py
│ ├── tokenization_jieba.py
│ └── tokenizer
│ ├── vocab.model
│ └── vocab.vocab
├── serving_agent.py
└── serving_server.py
Run bash start_pangu_distributed.sh
to start new execution, and wait until the serving and flask server are started
successfully.
If any error happened, log can be viewed in serving_server.log, serving_agent.log, serving_logs/*.log and flask.log.
If anything all right, access address {ip}:5000 in one browser. It will take some time to return the reply.
Run bash stop_pangu.sh
to stop the existing execution.
>>> cd scripts
>>> bash run_cluster_export.sh ${strategy_file_path} ${ckpt_dir_path} ${rank_table_file} ${rank_size} ${rank_start}
Update the parameter MODE
in run_distribute_export.sh
from 200B
to 13B
if we want to export 13B model.
The ${rank_start}
is the first rank id in every machine, likes 0,8,16,24.
The model will be exported for some minutes. Check log device_[0-7]/log[0-7].log, confirm that there is no exception at last. Confirm that mindir files have been generated in device_[0-7]/ which means that the model is exported successfully.
>>> cd - && mkdir serving_increment/pangu_distributed/models/
>>> mv scripts/device_* serving_increment/pangu_distributed/models/
>>> cd serving_increment
In the first machine, update the parameter rank_size
and stage_size
(Pipeline stage size) of serving_increment/pangu_distributed/pangu/servable_config.py
.
In the first machine, update the parameter rank_table_json_file
of serving_increment/pangu_distributed/serving_server.py
.
In every machine, update MindIR file name serving_increment/pangu_distributed/serving_agent.py if needed.
In every machine, update the parameter distributed_address
of serving_increment/pangu_distributed/serving_agent.py
and
serving_increment/pangu_distributed/serving_server.py
to the first machine ip address.
In the first machine, copy pangu-alpha/tokenizer
to directory serving_increment/pangu_distributed/pangu/tokenizer.
In the first machine, run bash start_pangu_distributed.sh
to start new execution.
Meanwhile, in other machines, run python serving_agent.py
to start serving agent process.
>>> unset http_proxy && unset https_proxy
>>> python pangu_distributed/serving_agent.py > serving_agent.log 2>&1 &
Wait until the serving and flask server are started successfully.
If any error happened, log can be viewed in serving_server.log, serving_agent.log, serving_logs/*.log and flask.log.
If anything all right, access address {first_machine_ip}:5000 in one browser. It will take some time to return the reply.
Run bash stop_pangu.sh
to stop the existing execution in every machine.
.
├── docs
│ └── model.png
├── predict.py
├── README.md
├── scripts
│ ├── run_distribute_predict.sh
│ └── run_distribute_train.sh
├── src
│ ├── dataset.py
│ ├── generate.py
│ ├── pangu_alpha_config.py
│ ├── pangu_alpha.py
│ ├── pangu_alpha_wrapcell.py
│ ├── preprocess.py
│ ├── tokenization_jieba.py
│ └── utils.py
└── train.py
Please check the official homepage.
For Serving and flask server, extra requirements:
Q: Unexpected error. MindRecordOp init failed, illegal column list
.
A: It's because the feature column name in dataset.py
is not consistent with the name in mindrecord. Pleasse pass args --data_column_name your_feature name
to the run_distribute_train.sh
Q: ERROR: device_num must be the power of 2
.
A: The number of the cards must be the power of 2 if we use the parallel training. For example, if we want to train the 2.6B model, the number of cards should be 2, 4, 8, 16 and so on.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。