# ModelLink-0712-weight-mcore-2 **Repository Path**: liuyanghan/ModelLink-0712-weight-mcore-2 ## Basic Information - **Project Name**: ModelLink-0712-weight-mcore-2 - **Description**: 昇腾大模型仓库 - **Primary Language**: Python - **License**: MIT - **Default Branch**: 1.1 - **Homepage**: https://gitee.com/ascend/ModelLink/ - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1207 - **Created**: 2024-07-19 - **Last Updated**: 2024-07-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

简体中文 | English

ModelLink aims to provide end-to-end large language model solutions for Huawei Ascend chips, including models, algorithms, and downstream tasks. ## ModelLink Solution Overview --- ### Supported features Current ModelLink supported features for large model usage: * [Dataset Preparation for Pre-training](#jump11)/[Fine-tuning Instruction Dataset Preparation](#jump12) * [Pre-training](#jump13)/[Full-parameter Fine-tuning](#jump14)/[Low-parameter Fine-tuning](#jump15) * [Inference: human-machine dialogue](#jump16) * [Evaluation with numerous benchmarks](#jump17) * [Utilizing Acceleration Features (Acceleration Algorithms + Fusion Operators)](#jump18) * [Profiling data based on Ascend chips](#jump19) * [Convert ckpt between huggingface and megatron](#jump19) * [Enbale deterministic computing function for Ascend](#jump21) More novel and useful features are developing for LLMs training on Ascend ... ### Supported Models Current ModelLink supports pre-training and fine-tuning for the following models:

Model	Parameters	Sequence length	Pretrain	Inference	LoRA	SFT	Chat	Evaluation	Contributor
Aquila	7B	2K	pretrain	generate	--	--	--	eval	【Ascend】
Aquila2	7B	2K	pretrain	generate	--	--	--	eval	【Community】
Aquila2	34B	4K	pretrain	generate	--	--	--	eval	【Community】
Baichuan	7B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan	13B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	7B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Baichuan2	13B	4K	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	7B1	2K	pretrain	generate	--	--	--	eval	【Ascend】
Bloom	176B	2K	pretrain	generate	--	--	--	eval	【Ascend】
ChatGLM3	6B	8K	pretrain	generate	--	--	--	eval	【Community】
CodeLlama	34B	4K	pretrain	generate	--	--	--	eval	【Community】
InternLM	7B	2K	pretrain	generate	--	--	--	eval	【Ascend】
InternLM	65B	2K	pretrain	--	--	--	--	--	【Ascend】
LLaMA	7B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	33B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
	65B	2K	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA2	7B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	13B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	34B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
	70B	4K	pretrain	generate	lora	--	--	eval	【Ascend】
LLaMA3	8B	8K	pretrain	generate	--	--	chat	eval	【Ascend】
LLaMA3	70B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Qwen	7B	8K	pretrain	generate	--	--	--	eval	【Ascend】
	14B	2K	pretrain	generate	--	--	--	eval	【Ascend】
	72B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Qwen1.5	0.5B	8K	pretrain	generate	--	--	--	eval	【Community】
	1.8B	8K	pretrain	generate	--	--	--	eval	【Community】
	4B	8K	pretrain	generate	--	--	--	eval	【Community】
	7B	8K	pretrain	generate	--	--	--	eval	【Community】
	14B	8K	pretrain	generate	--	--	--	eval	【Community】
	32B	8K	pretrain	generate	lora	--	--	eval	【Community】
	72B	8K	pretrain	generate	lora	--	--	eval	【Ascend】
Yi	34B	4K	pretrain	generate	--	--	--	eval	【Community】
Mixtral	8x7B	32K	pretrain	generate	--	--	--	eval	【Ascend】
Mistral	7B	32K	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	2B	8K	pretrain	generate	--	--	--	eval	【Ascend】
Gemma	7B	8K	pretrain	generate	lora	--	--	eval	【Ascend】
GPT3	175B	2K	pretrain	--	--	--	--	--	【Community】

### Script Naming Rules | Script | Rule | |:-----------------:|:-------------------:| | pretrain_xxx.sh | Pre-training Script | | tune_xxx.sh | Fine-tuning Script | | generate_xxx.sh | Inference Script | | xxx_chat_xxx.sh | Chat Script | | evaluation_xxx.sh | Evaluation Script | --- # Model Usage Guide and Version Notes Model Usage Guide and Version Notes For the supported models listed above, we provide training scripts and readme instructions in the examples folder, which contain detailed processes for model training, inference, and evaluation. 【Please note the corresponding environment versions for model usage, as follows】 | Software | [Version](https://www.hiascend.com/zh/) | | :-----------------------: |:----------------------------------:| | Python | 3.8 | | driver | Ascend HDK 24.1.RC2 | | firmware | Ascend HDK 24.1.RC2 | | CANN | CANN 8.0.RC2 | | torch | 2.1.0、2.2.0 | | torch_npu | release v6.0.RC2 | 【Based on the current version of megatron, the performance statistics from our testing are as follows (Hardware info：Atlas 900 A2 PODc)】

Model	Parameters	Sequence length	Cluster Scale	Precision Mode	Performance	Reference Performance
Aquila	7B	2K	1x8	BF16	2849	2874
Aquila2	7B	2K	1x8	FP16	3323	2673
Aquila2	34B	4K	2x8	BF16	854	732
Baichuan	7B	4K	1x8	FP16	2685	2036
Baichuan	13B	4K	1x8	FP16	1213	862
Baichuan2	7B	4K	1x8	BF16	2664	3969
Baichuan2	13B	4K	1x8	BF16	1668	2062
Bloom	7B1	2K	1x8	FP16	2034	2525
Bloom	176B	2K	12x8	BF16	100	107
ChatGLM3	6B	8K	1x8	FP16	4297	4267
CodeLlama	34B	4K	2x8	BF16	837	762
InternLM	7B	2K	1x8	BF16	2776	2854
InternLM	65B	2K	4x8	BF16	341	414
LLaMA	7B	2K	1x8	FP16	3600	3804
	13B	2K	1x8	FP16	1895	2012
	33B	2K	4x8	FP16	621	776
	65B	2K	4x8
	65B	2K	4x8	BF16	348	426
LLaMA2	7B	4K	1x8	BF16	4200	3850
	13B	4K	1x8	BF16	1990	1920
	34B	4K	2x8	BF16	749	796
	70B	4K	4x8	BF16	420	430
LLaMA3	8B	8K	1x8	BF16	2483	2674
LLaMA3	70B	8K	8x8	BF16	283	355
Qwen	7B	8K	1x8	BF16	2499	2867
	14B	2K	1x8	BF16	1560	1578
	72B	8K	16x8	BF16	285	345
Qwen1.5	0.5B	8K	1x8	BF16	22834	25306
	1.8B	8K	1x8	BF16	13029	12181
	4B	8K	1x8	BF16	5033	5328
	7B	8K	1x8	BF16	2862	2621
	14B	8K	1x8	BF16	1717	1702
	32B	8K	4x8	BF16	751	708
	72B	8K	8x8	BF16	301	317
Yi	34B	4K	2x8	BF16	809	730
Mixtral	8x7B	32K	2x8	BF16	487	610
Mistral	7B	32K	1x8	BF16	2806	2734
Gemma	2B	8K	1x8	BF16	6821	7602
Gemma	7B	8K	1x8	BF16	2938	2607
GPT3	175B	2K	16x8	FP16	153	--

--- ### Acceleration Features ModelLink supports various acceleration algorithms such as tensor parallelism, pipeline parallelism, sequence parallelism, recomputation, distributed optimizer, and more. The table below shows the enable switches corresponding to each acceleration feature: | Acceleration Feature | Enable Parameter | |:------------------------------------:|:------------------------------:| | Tensor Parallel | --tensor-model-parallel-size | | Pipeline Parallel | --pipeline-model-parallel-size | | Dynamic division for PP | --num-layer-list | | Sequence Parallel | --sequence-parallel | | Recomputation | --recompute-granularity | | Distributed Optimizer | --use-distributed-optimizer | | overlap DDP allreduce | --overlap-grad-reduce | | Flash attention | --use-flash-attn | | Fused rmsnorm | --use-fused-rmsnorm | | Fused swiglu | --use-fused-swiglu | | mc2 | --use-mc2 | | Fused rotary
position embedding | --use-fused-rotary-pos-emb | | Sliding window attention | --sliding-window | ```bash torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --num-layer-list 1,2,2,2,1 \ --sequence-parallel \ --recompute-granularity full \ --recompute-method block \ --recompute-num-layers 72 \ --use-distributed-optimizer \ --use-flash-attn \ --use-fused-rmsnorm \ --use-fused-swiglu \ --overlap-grad-reduce \ --use-fused-rotary-pos-emb \ --use-mc2 \ --sliding-window 4096 \ ... \ ... ``` ```bash Note: To enable mc2, ensure the following: 1. The environment version matches the description on the repository homepage; 2. Comment out line 283 in the validate_args_decorator function within modellink\arguments.py: #args.use_mc2 = False ``` ## Analyze profiling data based on Ascend chips Modellink supports analyze profiling data based on Ascend chips, which is useful for modelling: ```bash --profile # enable profiling --profile-step-start 5 # the start step --profile-step-end 6 # the end step --profile-ranks 0 1 2 3 4 # ranks for profiling --profile-level level2 # level0, 1, 2 for data profiling --profile-with-cpu # profiling cpu information --profile-with-stack # profile stack information --profile-with-memory # profile memory information --profile-record-shapes # profile shape information --profile-save-path ./profile_dir # path to save data ``` ## Enable deterministic computing based on Ascend chips - add choice in script ```shell --use-deter-comp ``` - add environment variable ```shell export HCCL_DETERMINISTIC=True ``` ## Acknowledgments --- ModelLink is jointly contributed by the following departments of Huawei Corporation: - Ascend Computing Product Unit - Algorithm Unit of Computing Product Unit - Research Unit of Computing Product Unit - Open Computing Kit of Computing Product Unit - General Development Department - Global Technical Service Department We appreciate every PR from community, and welcome to contribute to ModelLink. ## Appendix --- - Safety Statement: [Safety Statement](https://gitee.com/ascend/ModelLink/wikis/%E5%AE%89%E5%85%A8%E5%A3%B0%E6%98%8E)