diff --git a/examples/bloom/README.md b/examples/bloom/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e8b6fcb447723daf31b3055ace142a4bedf210f2 --- /dev/null +++ b/examples/bloom/README.md @@ -0,0 +1,145 @@ +# BLOOM + +This directory contains some of the scripts that were used to produce the results in the AscendSpeed. These scripts is to show the example how to run BLOOM in terminal. + +BLOOM model is from: [A 176B-Parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100) + +> Scao, Teven Le, et al. "Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100 (2022). + +# Contents + +TODO:需要更新二级 content. +- [Contents](#contents) +- [Evaluation and Tasks](#evaluation-and-tasks) + - [Dataset](#datastes) + - [LLama Evaluation](#llama-evaluation) + - [Zero-shot Task](#zeroshot-task) + +## Pre-Training + +BLOOM's architecture is very similar to GPT3 with a few added improvements as will be discussed later in this article. + +Here's a quick summary of training bloom: + +| | | +| :----- | :------------- | +| Hardware | 96 64GB Altas 910B NPUs | +| Software | AscendSpeed | +| Architecture | GPT3 w/ extras | +| Dataset | xxxxxxxxxx | +| Training time | xxxxxxxxxx | + +### Datasets + +TODO: change the context xxxx. Another important feature from Megatron-LM is the efficient data loader. During start up of the initial training each data set is split into samples of the requested sequence length (2048 for BLOOM) and index is created to number each sample. Based on the training parameters the number of epochs for a dataset is calculated and an ordering for that many epochs is created and then shuffled. For example, if a dataset has 10 samples and should be gone through twice, the system first lays out the samples indices in order [0, ..., 9, 0, ..., 9] and then shuffles that order to create the final global order for the dataset. Notice that this means that training will not simply go through the entire dataset and then repeat, it is possible to see the same sample twice before seeing another sample at all, but at the end of training the model will have seen each sample twice. This helps ensure a smooth training curve through the entire training process. These indices, including the offsets into the base dataset of each sample, are saved to a file to avoid recomputing them each time a training process is started. Several of these datasets can then be blended with varying weights into the final data seen by the training process. + +- 46 Languages in 1.5TB of deduplicated massively cleaned up text, converted into 350B unique tokens +- Vocabulary size of the model is 250,680 tokens +- For full details please see The BigScience Corpus A 1.6TB Composite Multilingual Dataset + +### Script + +To launch the environment use ``: + +```Shell +source $six_ALL_CCFRWORK/code/tr11-176B-ml/bigscience/train/tr11-176B-ml/start-tr11-176B-ml +``` + +There is an hourly pulse checking script running that checks that the training is either running or scheduled. + +The Training log will look like these: + +```Shell +XXXXX +``` + +### performance + +#### machine performance + +The performance of the NPUs in XXXXX(configuration) and GPUs is: + +TODO:通过表格呈现吞吐性能,还有并行配置 + +#### Accuracy of the loss + +NPU vs GPU loss. XXXX(Explain more). + +![NPU-LOSS](./images/7b_lm_loss.png) + +NPU vs GPU loss relative error. XXXX(Explain more). + +![NPU-Relative-Error](./images/relative_error.png) + +## Fine-tune and Evaluation + +TODO:提供微调的方式,先加载权重,再微调脚本,跟预训练格式一样;后面需要提供task的验证结果(待开发)。 + +## Inference + +We support AscendSpeed Inference for text generation with BLOOM 7B1. + +#### Model weights + +Download the BLOOM model checkpoint from [here](TODO: XXXXX), make sure all chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it: + +```bash +cat bloom-7b1.tar.part_* > gbloom-7b1.tar +tar xvf bloom-7b1.tar +``` + +Set `CHECKPOINT_PATH` in `/generate_bloom.sh` to the path of the extracted folder. Since the checkpoint file is large, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension. + +```bash +TODO: add convert_tp tools. + +python tools/convert_tp.py \ + --input-folder \ + --output-folder \ + --target-tp +``` + +### Script + +We generate text samples using the `generate_bloom` script. Inference different from pre-training, such as we need to Load pre training checkpoint and the length of the output samples: + +```shell +bash ./generate_bloom_7b1.sh +``` + +Alternatively you can also use DeepSpeed from source: + +```Shell +TODO: XXXX +``` + +### Samples + +
+Left-To-Right Generation Examples + +#### Example 1 +Input: Who is the greatest artist? The greatest artist is ? + +Output: TODO: XXXX. + +#### Example 2 (Chinese) +Input: 问题:冬天,中国哪座城市最适合避寒?问题描述:能推荐一些国内适合冬天避寒的城市吗?回答用户:旅游爱好者 回答:? + +Output: 问题: XXXX. +
+ +All the provided scripts are tested on 8 910B 64GB GPUs for BLOOM 7B1 (fp16). These scripts might not work for other models or a different number of NPUs. + +> Note: Sometimes NPUs memory is not freed when inference deployment crashes. You can free this memory by running kill all python in terminal. + +## Citation + +You may also consider original work in your reference: + +@article{scao2022bloom, + title={Bloom: A 176b-parameter open-access multilingual language model}, + author={Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili{\'c}, Suzana and Hesslow, Daniel and Castagn{\'e}, Roman and Luccioni, Alexandra Sasha and Yvon, Fran{\c{c}}ois and Gall{\'e}, Matthias and others}, + journal={arXiv preprint arXiv:2211.05100}, + year={2022} +} \ No newline at end of file diff --git a/examples/bloom_task/generate_text_bloom.py b/examples/bloom/generate_bloom.py similarity index 100% rename from examples/bloom_task/generate_text_bloom.py rename to examples/bloom/generate_bloom.py diff --git a/examples/bloom_task/generate_text_bloom.sh b/examples/bloom/generate_bloom_7b1.sh similarity index 80% rename from examples/bloom_task/generate_text_bloom.sh rename to examples/bloom/generate_bloom_7b1.sh index 6301de0052fb604e2fa39811400a25852cd9dfe1..6a75fc9d814fdaf663912a9685cc4544f343e392 100644 --- a/examples/bloom_task/generate_text_bloom.sh +++ b/examples/bloom/generate_bloom_7b1.sh @@ -1,18 +1,21 @@ #!/bin/bash +# Environment export LD_LIBRARY_PATH=/usr/local/lib:/root/miniconda3/lib:$LD_LIBRARY_PATH export HCCL_CONNECT_TIMEOUT=1200 source /usr/local/Ascend/ascend-toolkit/set_env.sh +# Distributed setting DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr localhost --master_port 6001" -CHECK_POINTS= -TOKENIZER_NAME_OR_PATH= +CHECKPOINTS_PATH= +TOKENIZER_PATH= -python -m torch.distributed.run $DISTRIBUTED_ARGS ./examples/bloom_task/generate_text_bloom.py \ - --load $CHECK_POINTS \ +# Real script +python -m torch.distributed.run $DISTRIBUTED_ARGS ./examples/bloom_task/generate_bloom.py \ + --load CHECKPOINTS_PATH \ --tokenizer-type PretrainedFromHF \ - --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \ + --tokenizer-name-or-path TOKENIZER_PATH \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 1 \ --embed-layernorm \ @@ -34,4 +37,4 @@ python -m torch.distributed.run $DISTRIBUTED_ARGS ./examples/bloom_task/generate --out-seq-length 1024 \ --temperature 1.0 \ --top_p 0.9 \ - --recompute \ \ No newline at end of file + --recompute diff --git a/examples/bloom_task/images/7b_lm_loss.png b/examples/bloom/images/7b_lm_loss.png similarity index 100% rename from examples/bloom_task/images/7b_lm_loss.png rename to examples/bloom/images/7b_lm_loss.png diff --git a/examples/bloom_task/images/relative_error.png b/examples/bloom/images/relative_error.png similarity index 100% rename from examples/bloom_task/images/relative_error.png rename to examples/bloom/images/relative_error.png diff --git a/examples/bloom_task/pretrain_bloom_550m.sh b/examples/bloom/pretrain_bloom_550m.sh similarity index 100% rename from examples/bloom_task/pretrain_bloom_550m.sh rename to examples/bloom/pretrain_bloom_550m.sh diff --git a/examples/bloom_task/pretrain_bloom_7b1.sh b/examples/bloom/pretrain_bloom_7b1.sh similarity index 96% rename from examples/bloom_task/pretrain_bloom_7b1.sh rename to examples/bloom/pretrain_bloom_7b1.sh index 9dede337487404a4ca25f77040d6659a6b175950..a0ca468dfeb923ba9616c2d7dc3d2df8ad908692 100644 --- a/examples/bloom_task/pretrain_bloom_7b1.sh +++ b/examples/bloom/pretrain_bloom_7b1.sh @@ -15,8 +15,6 @@ MASTER_ADDR=localhost MASTER_PORT=5999 GPUS_PER_NODE=8 NNODES=1 -PP_SIZE=1 -TP_SIZE=8 MICRO_BATCH_SIZE=4 GLOBAL_BATCH_SIZE=512 @@ -76,8 +74,8 @@ TRANSFORMERS_OFFLINE=1 \ --attention-dropout 0 \ --hidden-dropout 0 \ --pad-vocab-size-to 250880 \ - --tensor-model-parallel-size $TP_SIZE \ - --pipeline-model-parallel-size $PP_SIZE \ + --tensor-model-parallel-size 8 \ + --pipeline-model-parallel-size 1 \ --num-layers $NLAYERS \ --hidden-size $NHIDDEN \ --num-attention-heads $NHEADS \ @@ -119,6 +117,3 @@ TRANSFORMERS_OFFLINE=1 \ --zero-stage ${ZERO_STAGE} \ --deepspeed-activation-checkpointing \ --distributed-backend nccl - - - # --finetune \ \ No newline at end of file diff --git a/examples/bloom_task/README.md b/examples/bloom_task/README.md deleted file mode 100644 index 4141f7cdd41064d03dc32a8e91a4c542dce7177d..0000000000000000000000000000000000000000 --- a/examples/bloom_task/README.md +++ /dev/null @@ -1,49 +0,0 @@ -# text generation using bloom - -`bash examples/bloom_task/generate_text_bloom.sh` - -> We generate text samples using largely the Bloom pretraining script. Few changes need to make, such as we need to provide the path to the pretrained checkpoint, the length of the output samples. - -```shell -DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6001" - -CHECK_POINTS= -TOKENIZER_NAME_OR_PATH= - -python -m torch.distributed.run $DISTRIBUTED_ARGS ./examples/bloom_task/generate_text_bloom.py \ - --load $CHECK_POINTS \ - --tokenizer-type PretrainedFromHF \ - --tokenizer-name-or-path $TOKENIZER_NAME_OR_PATH \ - --tensor-model-parallel-size 8 \ - --pipeline-model-parallel-size 1 \ - --embed-layernorm \ - --position-embedding-type alibi \ - --num-layers 30 \ - --hidden-size 4096 \ - --attention-dropout 0 \ - --hidden-dropout 0 \ - --num-attention-heads 32 \ - --micro-batch-size 1 \ - --seq-length 2048 \ - --max-position-embeddings 2048 \ - --init-method-std 0.0048 \ - --log-interval 1 \ - --layernorm-epsilon 1e-6 \ - --fp16 \ - --no-load-optim \ - --no-load-rng \ - --out-seq-length 1024 \ - --temperature 1.0 \ - --top_p 0.9 \ - --recompute \ -``` - -# bloom 7b 1000 step perftest - -**NPU loss** - -![NPU-LOSS](./images/7b_lm_loss.png) - -**NPU GPU loss relative error** - -![NPU-Relative-Error](./images/relative_error.png) \ No newline at end of file diff --git a/examples/gpt_task/pretrain_gpt_1p.sh b/examples/gpt/pretrain_gpt_1p.sh similarity index 100% rename from examples/gpt_task/pretrain_gpt_1p.sh rename to examples/gpt/pretrain_gpt_1p.sh diff --git a/examples/gpt_task/pretrain_gpt_ptd_8p.sh b/examples/gpt/pretrain_gpt_ptd_8p.sh similarity index 100% rename from examples/gpt_task/pretrain_gpt_ptd_8p.sh rename to examples/gpt/pretrain_gpt_ptd_8p.sh diff --git a/examples/gpt_task/pretrain_gpt_td_8p.sh b/examples/gpt/pretrain_gpt_td_8p.sh similarity index 100% rename from examples/gpt_task/pretrain_gpt_td_8p.sh rename to examples/gpt/pretrain_gpt_td_8p.sh diff --git a/examples/llama_task/README.md b/examples/llama/README.md similarity index 67% rename from examples/llama_task/README.md rename to examples/llama/README.md index aaedcda4432d4ab8964c6d73221b3ba165ba7c38..28746ad76c959009bd4bad5a9d3ac90dbe9fe805 100644 --- a/examples/llama_task/README.md +++ b/examples/llama/README.md @@ -1,20 +1,24 @@ # Contents + - [Contents](#contents) - [Evaluation and Tasks](#evaluation-and-tasks) - - [Datastes](#datastes) + - [Dataset](#datastes) - [LLama Evaluation](#llama-evaluation) - - [Zeroshot Task](#zeroshot-task) + - [Zero-shot Task](#zeroshot-task) + # Evaluation and Tasks -## Datastes -First of all, You must download the evaluation datasets for the [BoolQ](https://storage.googleapis.com/boolq/dev.jsonl), PIQA ([1](https://yonatanbisk.com/piqa/data/valid.jsonl), [2](https://yonatanbisk.com/piqa/data/valid-labels.lst)), [HellaSwag](https://github.com/rowanz/hellaswag/tree/master/data/hellaswag_val.jsonl) tasks. +## Dataset + +First of all, You must download the evaluation dataset for the [BoolQ](https://storage.googleapis.com/boolq/dev.jsonl), PIQA ([1](https://yonatanbisk.com/piqa/data/valid.jsonl), [2](https://yonatanbisk.com/piqa/data/valid-labels.lst)), [HellaSwag](https://github.com/rowanz/hellaswag/tree/master/data/hellaswag_val.jsonl) tasks. ## LLama Evaluation -We include zeroshot example scripts for llama evaluation on [BoolQ](https://storage.googleapis.com/boolq/dev.jsonl), PIQA ([1](https://yonatanbisk.com/piqa/data/valid.jsonl), [2](https://yonatanbisk.com/piqa/data/valid-labels.lst)), and [HellaSwag](https://github.com/rowanz/hellaswag/tree/master/data/hellaswag_val.jsonl) accuracy. +We include zero-shot example scripts for llama evaluation on [BoolQ](https://storage.googleapis.com/boolq/dev.jsonl), PIQA ([1](https://yonatanbisk.com/piqa/data/valid.jsonl), [2](https://yonatanbisk.com/piqa/data/valid-labels.lst)), and [HellaSwag](https://github.com/rowanz/hellaswag/tree/master/data/hellaswag_val.jsonl) accuracy. For example, you can use the following command to run BoolQ zeroshot task on a Llama-7B parameter model. -
+
+```Shell
 WORLD_SIZE=8
 
 DISTRIBUTED_ARGS="--nproc_per_node $WORLD_SIZE \
@@ -47,12 +51,12 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS ./tasks/main.py \
                --fp16 \
                --no-load-optim \
                --no-load-rng
-
+``` -## Zeroshot Task +## Zero-shot Task -The following table shows the NPU and [LLama Paper](https://arxiv.org/abs/2302.13971) accuracy achieved by the Zeroshot task of the Llama model. +The following table shows the NPU and [LLama Paper](https://arxiv.org/abs/2302.13971) accuracy achieved by the Zero-shot task of the Llama model. | Model Size | BoolQ | PIQA | HellaSwag | | :---: | :---: | :---: | :---: | diff --git a/examples/llama_task/evaluate_llama_zeroshot_7b.sh b/examples/llama/evaluate_llama_zeroshot_7b.sh similarity index 100% rename from examples/llama_task/evaluate_llama_zeroshot_7b.sh rename to examples/llama/evaluate_llama_zeroshot_7b.sh diff --git a/examples/llama_task/pretrain_llama_1p.sh b/examples/llama/pretrain_llama_1p.sh similarity index 100% rename from examples/llama_task/pretrain_llama_1p.sh rename to examples/llama/pretrain_llama_1p.sh diff --git a/examples/llama_task/pretrain_llama_ptd_8p.sh b/examples/llama/pretrain_llama_ptd_8p.sh similarity index 100% rename from examples/llama_task/pretrain_llama_ptd_8p.sh rename to examples/llama/pretrain_llama_ptd_8p.sh diff --git a/examples/llama_task/pretrain_llama_td_8p.sh b/examples/llama/pretrain_llama_td_8p.sh similarity index 100% rename from examples/llama_task/pretrain_llama_td_8p.sh rename to examples/llama/pretrain_llama_td_8p.sh diff --git a/examples/llama_task/pretrain_llama_td_8p_sp.sh b/examples/llama/pretrain_llama_td_8p_sp.sh similarity index 100% rename from examples/llama_task/pretrain_llama_td_8p_sp.sh rename to examples/llama/pretrain_llama_td_8p_sp.sh