diff --git a/docs/api/api_python/mindformers/mindformers.AutoConfig.rst b/docs/api/api_python/mindformers/mindformers.AutoConfig.rst index fb5d183d18b0bc40428ced173eaea970dcc411d5..b07bc172c01493f13cd55a93e76c9885bcef2ba2 100644 --- a/docs/api/api_python/mindformers/mindformers.AutoConfig.rst +++ b/docs/api/api_python/mindformers/mindformers.AutoConfig.rst @@ -30,7 +30,7 @@ mindformers.AutoConfig 这个API正处于实验阶段,在下一个版本中可能会有一些突破性的变化。 参数: - - **model_type** (str) - 模型简称,类似'llama3_1'。 + - **model_type** (str) - 模型简称,类似'qwen2_5'。 - **config** (PretrainedConfig) - 用于注册的类。 - **exist_ok** (bool, 可选) - 为True时,若model_type已存在也不报错。默认值: ``False`` 。 diff --git a/mindformers/models/auto/configuration_auto.py b/mindformers/models/auto/configuration_auto.py index 8b27300b6cf1131d20094cdb657eeb034767e0d0..963eed1752f48fd359081cd557224a58eca11b89 100644 --- a/mindformers/models/auto/configuration_auto.py +++ b/mindformers/models/auto/configuration_auto.py @@ -417,7 +417,7 @@ class AutoConfig: The API is experimental and may have some slight breaking changes in the next releases. Args: - model_type (str): The model type like "llama3_1". + model_type (str): The model type like "qwen2_5". config (PretrainedConfig): The config to register. exist_ok (bool, optional): If set to True, no error will be raised even if model_type already exists. Default: ``False``. diff --git a/research/llama3_1/README.md b/research/llama3_1/README.md deleted file mode 100644 index 30561b2dcbf0dec83361a87d5a2105aeae191348..0000000000000000000000000000000000000000 --- a/research/llama3_1/README.md +++ /dev/null @@ -1,453 +0,0 @@ -# Llama 3.1 - -## 模型描述 - -Llama 3.1,是开源Llama系列的最新产品,目前有三个版本:Llama 3.1-8B,Llama 3.1-70B,Llama 3.1-405B。 -Llama 3.1在来自公开可用来源的超过15T的数据上进行了预训练。微调数据包括公开可用的指令数据集,以及超过1000万个人工标注的示例。 -模型支持上下文窗口长度128K,并使用了新的分词器,词汇表大小达到128256个,采用了分组查询注意力机制(GQA)。 -Llama 3.1模型是类GPT模型,是一个生成式的语言模型,主要是用于预测下一个单词。 -目前Mindformers支持Llama 3.1-8B,Llama 3.1-70B,敬请期待Llama 3.1-405B。 - -## 模型性能 - -以下模型性能均由Atlas 800T A2硬件环境下测试得出。 - -| Config | Task | Datasets | SeqLength | Performance | Phase | -|:-------------------------------------------------------|:---------------:|:--------:|:---------:|:------------:|:-------:| -| [llama3_1_8b](llama3_1_8b/predict_llama3_1_8b.yaml) | text_generation | - | 2048 | 591 tokens/s | Predict | -| [llama3_1_70b](llama3_1_70b/predict_llama3_1_70b.yaml) | text_generation | - | 4096 | 509 tokens/s | Predict | - -以下模型性能均由Atlas 900 A2 PoDc硬件环境下测试得出。 - -| Config | Task | Datasets | SeqLength | Performance | Phase | -|:--------------------------------------------------------|:---------------:|:--------:|:---------:|:---------------:|:--------:| -| [llama3_1_8b](llama3_1_8b/finetune_llama3_1_8b.yaml) | text_generation | alpaca | 8192 | 2703 tokens/s/p | Finetune | -| [llama3_1_70b](llama3_1_70b/finetune_llama3_1_70b.yaml) | text_generation | alpaca | 8192 | 337 tokens/s/p | Finetune | - -## 模型文件 - -`Llama 3.1` 基于 `mindformers` 实现,主要涉及的文件有: - -1. 模型具体实现: - - ```text - mindformers/models/llama - ├── __init__.py - ├── llama.py # 模型实现 - ├── llama_config.py # 模型配置项 - ├── llama_layer.py # llama网络层定义 - ├── llama_processor.py # llama预处理 - └── llama_transformer.py # transformer层实现 - ``` - -2. 模型配置: - - ```text - research/llama3_1 - ├──llama3_1_8b - │ ├── predict_llama3_1_8b.yaml # 8B推理配置 - │ └── finetune_llama3_1_8b.yaml # 8B全量微调启动配置 - └──llama3_1_70b - ├── predict_llama3_1_70b.yaml # 70B推理配置 - └── finetune_llama3_1_70b.yaml # 70B全量微调启动配置 - ``` - -3. 数据预处理脚本和任务启动脚本: - - ```text - research/llama3_1 - ├── llama3_1_tokenizer.py # llama3_1 tokenizer处理脚本 - ├── llama3_1_conversation.py # 微调数据集处理,将原始alpaca转换为对话形式alpaca - └── llama3_1_preprocess.py # llama模型的mindrecord数据处理脚本 - ``` - -## 环境及数据准备 - -### 安装环境 - -MindFormers软硬件配套关系以及安装参考[环境安装指南](../../README_CN.md#源码编译安装) -和[版本匹配关系](../../README_CN.md#版本匹配关系)。 - -### 数据集及权重准备 - -#### 数据集下载 - -MindFormers提供**alpaca**作为[微调](#微调)数据集。 - -| 数据集名称 | 适用模型 | 适用阶段 | 下载链接 | -|:--------|:------------------------------:|:--------:|:-------------------------------------------------------------------------------:| -| alpaca | llama3_1-8b
llama3_1-70b | Finetune | [Link](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) | - -数据预处理中所用的`tokenizer.model`可以参考[模型权重下载](#模型权重下载)进行下载。 - -- **alpaca 数据预处理** - - 1. 执行`mindformers/tools/dataset_preprocess/llama/alpaca_converter.py`,使用fastchat工具添加prompts模板,将原始数据集转换为多轮对话格式。 - - ```shell - python alpaca_converter.py \ - --data_path /{path}/alpaca_data.json \ - --output_path /{path}/alpaca-data-conversation.json - - # 参数说明 - data_path: 输入下载的文件路径 - output_path: 输出文件的保存路径 - ``` - - 2. 执行`research/llama3_1/llama3_1_preprocess.py`,生成Mindrecord数据,将带有prompt模板的数据转换为mindrecord格式。 - - ```shell - # 此工具依赖fschat工具包解析prompt模板, 请提前安装fschat >= 0.2.13 python = 3.9 - python llama3_1_preprocess.py \ - --dataset_type qa \ - --input_glob /{path}/alpaca-data-conversation.json \ - --model_file /{path}/tokenizer.model \ - --seq_length 8192 \ - --output_file /{path}/alpaca-fastchat8192.mindrecord - - # 参数说明 - dataset_type: 预处理数据类型 - input_glob: 转换后的alpaca的文件路径 - model_file: 模型tokenizer.model文件路径 - seq_length: 输出数据的序列长度 - output_file: 输出文件的保存路径 - ``` - -> 数据处理时候注意bos,eos,pad等特殊`ids`要和配置文件中`model_config`里保持一致。 - -#### 模型权重下载 - -MindFormers暂时没有提供权重,用户可以下载HuggingFace官方权重经过[模型权重转换](#模型权重转换)后进行使用。 - -词表下载链接:[tokenizer.model](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) - -| 模型名称 | MindSpore权重 | HuggingFace权重 | -|:-------------|:-----------:|:------------------------------------------------------------:| -| Llama3_1-8B | - | [Link](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) | -| Llama3_1-70B | - | [Link](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) | - -> 注: 请自行申请huggingface上llama3_1使用权限,并安装transformers=4.40版本 - -#### 模型权重转换 - -下载完成后,运行`mindformers/convert_weight.py`转换脚本,将huggingface的权重转换为完整的ckpt权重。 - -```shell -python convert_weight.py --model llama --input_path TORCH_CKPT_DIR --output_path {path}/MS_CKPT_NAME --dtype bf16 - -# 参数说明 -model: 模型名称 -input_path: 下载HuggingFace权重的文件夹路径 -output_path: 转换后的MindSpore权重文件保存路径 -dtype: 转换权重的精度 -``` - -## 微调 - -### 全参微调 - -MindFormers提供`Llama3_1-8b`单机多卡以及`Llama3_1-70b`多机多卡的的微调示例,过程中使用`alpaca` -数据集对模型进行微调,数据集可以参考[数据集下载](#数据集下载)获得。 - -#### 单机训练 - -以Llama3_1-8b为例,Llama3_1-8B在Atlas 800T A2上训练,支持**单机/多机训练**。 - -使用`finetune_llama3_1_8b.yaml`进行训练,或修改默认配置文件中的`model_config.seq_length` -,使训练配置与数据集的`seq_length`保持一致。 - -执行命令启动微调任务,在单机上拉起任务。 - -```shell -# 单机8卡默认快速启动 -bash scripts/msrun_launcher.sh "run_mindformer.py \ - --register_path research/llama3_1 \ - --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \ - --load_checkpoint model_dir/xxx.ckpt \ - --auto_trans_ckpt True \ - --use_parallel True \ - --run_mode finetune \ - --train_data dataset_dir" - -# 参数说明 -config: 配置文件路径 -load_checkpoint: 权重文件路径 -auto_trans_ckpt: 自动权重转换开关 -run_mode: 运行模式, 微调时设置为finetune -train_data: 训练数据集路径 -``` - -#### 多机训练 - -以llama3_1-70b为例,使用`finetune_llama3_1_70b.yaml`配置文件,执行8机64卡微调。需要先对权重进行切分,切分权重可以参见[权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html#%E6%9D%83%E9%87%8D%E5%88%87%E5%88%86%E4%B8%8E%E5%90%88%E5%B9%B6)(如果是共享盘也可以开启自动权重转换,使用完整权重)。 - -多机多卡执行脚本进行分布式训练需要分别在不同节点运行脚本,并将参数MASTER_ADDR设置为主节点的ip地址,所有节点设置的ip地址相同,不同节点之间仅参数NODE_RANK不同,各个参数位置含义参见[使用指南](../../README_CN.md#三使用指南)。 - -在每台机器上运行以下命令,多机运行命令在每台机器上仅`node_num` 不同,从0开始计数,命令中主节点ip为第0个节点ip。 - -```shell -# 节点0,设0节点ip为192.168.1.1,作为主节点ip,总共64卡且每个节点8卡 -# 节点0、节点1、...节点7 依此修改node_num,比如8机,node_num为0~7。 -bash scripts/msrun_launcher.sh "run_mindformer.py \ - --register_path research/llama3_1 \ - --config research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml \ - --load_checkpoint model_dir/xxx.ckpt \ - --train_data dataset_dir \ - --auto_trans_ckpt False \ - --use_parallel True \ - --run_mode finetune" \ - 64 8 {主节点ip} 8118 {node_num} output/msrun_log False 300 -``` - -## 推理 - -MindFormers提供`Llama3_1-8b`的快速推理脚本,脚本主要通过generate高阶接口实现,支持单卡推理。推理输入默认不添加bos字符,如果需要添加可在config中增加add_bos_token选项。 - -```shell -# 脚本使用 -bash scripts/examples/llama3/run_llama3_predict.sh PARALLEL CONFIG_PATH CKPT_PATH DEVICE_NUM - -# 参数说明 -PARALLEL: 是否使用多卡推理, 'single'表示单卡推理, 'parallel'表示多卡推理 -CONFIG_PATH: 模型配置文件路径 -CKPT_PATH: 模型权重文件路径 -VOCAB_FILE: 词表路径 -DEVICE_NUM: 使用卡数, 仅开启多卡推理时生效 -``` - -### 单卡推理 - -以`Llama3_1-8b`单卡推理为例。 - -```shell -bash scripts/examples/llama3/run_llama3_predict.sh single \ - research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml \ - path/to/llama3_1_8b.ckpt \ - path/to/tokenizer.model -``` - -### 多卡推理 - -以`Llama3_1-70b`4卡推理为例。Llama3_1-70b权重较大,建议先进行权重切分,参见[权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html#%E6%9D%83%E9%87%8D%E5%88%87%E5%88%86%E4%B8%8E%E5%90%88%E5%B9%B6)。 - -```shell -bash scripts/examples/llama3/run_llama3_predict.sh parallel \ - research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml \ - path/to/model_dir \ - path/to/tokenizer.model 4 -``` - -## 基于MindIE的服务化推理 - -MindIE,全称Mind Inference Engine,是华为昇腾针对AI全场景业务的推理加速套件。 - -MindFormers承载在模型应用层MindIE-LLM中,MindIE-LLM是大语言模型推理框架,提供API支持大模型推理能力。 - -MindIE安装流程请参考[MindIE服务化部署文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/guide/deployment.html)。 - -以下例子默认已完成MindIE安装部署且仅适用于**MindIE RC3版本**,且安装路径均为默认路径`/usr/local/Ascend/`。 - -### 单卡推理 - -此例子使用llama3_1-8B模型演示。 - -#### 修改MindIE启动配置 - -打开mindie-service中的config.json文件,修改server相关配置。 - -```bash -vim /usr/local/Ascend/mindie/1.0.RC3/mindie-service/conf/config.json -``` - -需要关注以下字段的配置 - -1. `ModelDeployConfig.ModelConfig.backendType` - - 该配置为对应的后端类型,必填"ms"。 - - ```json - "backendType": "ms" - ``` - - 2. `ModelDeployConfig.ModelConfig.modelWeightPath` - - 该配置为模型配置文件目录,放置模型和tokenizer等相关文件。 - - 以llama3_1-8B为例,`modelWeightPath`的组织结构如下: - - ```reStructuredText - mf_model - └── llama3_1_8b - ├── config.json # 模型json配置文件 - ├── tokenizer.model # 模型vocab文件,hf上对应模型下载 - ├── predict_llama3_1_8b.yaml # 模型yaml配置文件 - ├── llama3_1_tokenizer.py # 模型tokenizer文件,从mindformers仓中research目录下找到对应模型复制 - └── llama3_1_8b.ckpt # 单卡模型权重文件 - ``` - - predict_llama3_1_8b.yaml需要关注以下配置: - - ```yaml - load_checkpoint: '/mf_model/llama3_1_8b/llama3_1_8b.ckpt' # 为存放模型单卡权重文件路径 - use_parallel: False - model: - model_config: - type: LlamaConfig - auto_map: - AutoTokenizer: [llama3_1_tokenizer.Llama3Tokenizer, null] - processor: - tokenizer: - vocab_file: "/mf_model/llama3_1_8b/tokenizer.model" #vocab文件路径 - ``` - - 模型的config.json文件可以使用`save_pretrained`接口生成,示例如下: - - ```python - from mindformers import AutoConfig - - model_config = AutoConfig.from_pretrained("/mf_model/llama3_1_8b/predict_llama3_1_8b.yaml ") - model_config.save_pretrained(save_directory="/mf_model/llama3_1_8b", save_json=True) - ``` - - 模型权重下载和转换可参考 [权重格式转换](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.ckpt_to_safetensors.html)。 - - 准备好模型配置目录后,设置参数`modelWeightPath`为该目录路径。 - -```json - "modelWeightPath": "/mf_model/llama3_1_8b" -``` - -最终修改完后的config.json如下: - -```json -{ - "Version": "1.0.0", - "LogConfig" : - { - "logLevel" : "Info", - "logFileSize" : 20, - "logFileNum" : 20, - "logPath" : "logs/mindservice.log" - }, - - "ServerConfig" : - { - "ipAddress" : "127.0.0.1", - "managementIpAddress": "127.0.0.2", - "port" : 1025, - "managementPort" : 1026, - "metricsPort" : 1027, - "maxLinkNum" : 1000, - "httpsEnabled" : false, - "fullTextEnabled" : false, - "tlsCaPath" : "security/ca/", - "tlsCaFile" : ["ca.pem"], - "tlsCert" : "security/certs/server.pem", - "tlsPk" : "security/keys/server.key.pem", - "tlsPkPwd" : "security/pass/key_pwd.txt", - "tlsCrl" : "security/certs/server_crl.pem", - "managementTlsCaFile" : ["management_ca.pem"], - "managementTlsCert" : "security/certs/management/server.pem", - "managementTlsPk" : "security/keys/management/server.key.pem", - "managementTlsPkPwd" : "security/pass/management/key_pwd.txt", - "managementTlsCrl" : "security/certs/management/server_crl.pem", - "kmcKsfMaster" : "tools/pmt/master/ksfa", - "kmcKsfStandby" : "tools/pmt/standby/ksfb", - "inferMode" : "standard", - "pdInterNodeTLSEnabled": false, - "pdCommunicationPort": 1121, - "interNodeTlsCaFile" : "security/grpc/ca/ca.pem", - "interNodeTlsCert" : "security/grpc/certs/server.pem", - "interNodeTlsPk" : "security/grpc/keys/server.key.pem", - "interNodeTlsPkPwd" : "security/grpc/pass/key_pwd.txt", - "interCommTlsCrl" : "security/certs/server_crl.pem", - "interNodeKmcKsfMaster": "tools/pmt/master/ksfa", - "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb" - }, - - "BackendConfig": { - "backendName" : "mindieservice_llm_engine", - "modelInstanceNumber" : 1, - "npuDeviceIds" : [[0]], - "tokenizerProcessNumber" : 8, - "multiNodesInferEnabled": false, - "multiNodesInferPort": 1120, - "interNodeTLSEnabled": true, - "interNodeTlsCaFile": "security/grpc/ca/ca.pem", - "interNodeTlsCert": "security/grpc/certs/server.pem", - "interNodeTlsPk": "security/grpc/keys/server.key.pem", - "interNodeTlsPkPwd": "security/grpc/pass/mindie_server_key_pwd.txt", - "interNodeTlsCrl" : "security/grpc/certs/server_crl.pem", - "interNodeKmcKsfMaster": "tools/pmt/master/ksfa", - "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb", - "ModelDeployConfig": - { - "maxSeqLen" : 2560, - "maxInputTokenLen" : 2048, - "truncation" : false, - "ModelConfig" : [ - { - "modelInstanceType": "Standard", - "modelName" : "llama3_1_8b", - "modelWeightPath" : "/mf_model/llama3_1_8b", - "worldSize" : 1, - "cpuMemSize" : 16, - "npuMemSize" : 16, - "backendType": "ms" - } - ] - }, - - "ScheduleConfig": - { - "templateType": "Standard", - "templateName" : "Standard_LLM", - "cacheBlockSize" : 128, - - "maxPrefillBatchSize" : 50, - "maxPrefillTokens" : 8192, - "prefillTimeMsPerReq" : 150, - "prefillPolicyType" : 0, - - "decodeTimeMsPerReq" : 50, - "decodePolicyType" : 0, - - "maxBatchSize" : 200, - "maxIterTimes" : 512, - "maxPreemptCount" : 0, - "supportSelectBatch" : false, - "maxQueueDelayMicroseconds" : 5000 - } - } -} -``` - -> 注:为便于测试,`httpsEnabled`参数设置为`false`,忽略后续https通信相关参数。 - -#### 启动服务 - -```bash -cd /usr/local/Ascend/mindie/1.0.RC3/mindie-service -nohup ./bin/mindieservice_daemon > output.log 2>&1 & -tail -f output.log -``` - -打印如下信息,启动成功。 - -```json -Daemon start success! -``` - -#### 请求测试 - -服务启动成功后,可使用curl命令发送请求验证,样例如下: - -```bash -curl -w "\ntime_total=%{time_total}\n" -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"inputs": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n请介绍一下自己<|im_end|>\n<|im_start|>assistant\n","stream": false}' http://127.0.0.1:1035/generate -``` - -返回推理结果验证成功: - -```json -{"generated_text":"我叫小助手,专门为您服务的。<|im_end|>\n<"} -``` diff --git a/research/llama3_1/infer/layers.py b/research/llama3_1/infer/layers.py deleted file mode 100644 index 3af3965e1e40c24a52cf684bff67e5584ccd0a61..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/layers.py +++ /dev/null @@ -1,523 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -Layers -""" -import mindspore.common.dtype as mstype -import mindspore.ops.functional as F -import mindspore.ops.operations as P -from mindspore import Parameter, Tensor, mint, nn, ops -from mindspore.common.initializer import initializer - -from mindformers.parallel_core.inference.parallel_state import default_pgs -from mindformers.parallel_core.inference.tensor_parallel.mappings import (gather_from_model_parallel_region, - reduce_from_model_parallel_region, - reduce_scatter_to_model_parallel_region, - scatter_to_model_parallel_region) -from mindformers.parallel_core.inference.tensor_parallel.random import (TENSOR_PARALLEL_GENERATOR, - get_rng_tracer) -from mindformers.parallel_core.inference.utils import divide -from mindformers.version_control import check_valid_gmm_op -from mindformers.models.utils import jit - - -class ColumnParallelLinear(nn.Cell): - """ - The dense layer with weight sliced on second dimension by tensor parallel size. - This layer implements the operation as: - - .. math:: - \\text{outputs} = \\text{inputs} * \\text{weight} + \\text{bias}, - - where :math:`inputs` is the input tensors, :math:`\\text{weight}` is a weight matrix created by the layer, - and :math:`\\text{bias}` is a bias vector created by the layer (only if has_bias is True). - - Args: - input_size (int): The number of channels in the input space. - output_size (int): The number of channels in the output space. - config (dict): Parallel configuration. - weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values - of str refer to the function `initializer`. Default: 'normal'. - bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The values - of str refer to the function `initializer`. Default: 'zeros'. - bias (bool): Specifies whether the layer uses a bias vector. Default: True. - gather_output (bool): Specifies whether gather the output on each tensor parallel rank. Default: False. - skip_weight_param_allocation (bool): Specifies whether skip the initialization of weight parameter. - When set True, an weight tensor should be passed to construct function. Default: False. - is_expert (bool): Specifies whether this linear layer is an expert. Default: False. - transpose_b (bool): Specifies whether the weight parameter will be initialized as a transposed shape. - param_init_type (dtype.Number): The parameter initialization type. Default: mstype.float32. - compute_dtype (dtype.Number): The computation type. Default: mstype.float16. - expert_num (int): The number of expert. Default: 1. - tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs. - - Inputs: - - **x** (Tensor) - Tensor of shape :math:`(*, in\\_channels)`. The `input_size` in `Args` should be equal - to :math:`in\\_channels` in `Inputs`. - - Outputs: - Tensor of shape :math:`(*, out\\_channels)`. - - Raises: - ValueError: `skip_weight_param_allocation=True` but weight_tensor is not passed to construct function. - - Supported Platforms: - ``Ascend`` - """ - - def __init__( - self, - input_size, - output_size, - config, - weight_init="normal", - bias_init="zeros", - bias=True, - gather_output=False, - stride=1, - keep_master_weight_for_test=False, - skip_bias_add=False, - skip_weight_param_allocation=False, - embedding_activation_buffer=None, - grad_output_buffer=None, - is_expert=False, - tp_comm_buffer_name=None, - disable_grad_reduce=False, - transpose_b=True, - param_init_type=mstype.float32, - compute_dtype=mstype.float16, - expert_num=1, - tp_group=default_pgs, - ): - super(ColumnParallelLinear, self).__init__() - if stride > 1: - raise NotImplementedError("For ColumnParallelLinear, `stride > 1` is not supported for now, " - "but got `stride={}`".format(stride)) - if keep_master_weight_for_test: - raise NotImplementedError("For ColumnParallelLinear, `keep_master_weight_for_test=True` " - "is not supported for now.") - if skip_bias_add: - raise NotImplementedError("For ColumnParallelLinear, `skip_bias_add=True` is not supported for now.") - if embedding_activation_buffer: - raise NotImplementedError("For ColumnParallelLinear, `embedding_activation_buffer` is not supported " - "for now.") - if grad_output_buffer: - raise NotImplementedError("For ColumnParallelLinear, `grad_output_buffer` is not supported for now.") - if tp_comm_buffer_name: - raise NotImplementedError("For ColumnParallelLinear, `tp_comm_buffer_name` is not supported for now.") - if disable_grad_reduce: - raise NotImplementedError("For ColumnParallelLinear, `disable_grad_reduce=True` is not supported for now.") - - self.input_size = input_size - self.output_size = output_size - self.has_bias = bias - self.gather_output = gather_output - self.tp_group = tp_group - self.tensor_parallel_group_size = self.tp_group.size - - self.output_size_per_partition = divide(output_size, self.tensor_parallel_group_size) - self.is_expert = is_expert - self.expert_num = expert_num - self.skip_weight_param_allocation = skip_weight_param_allocation - self.parallel_config = config - self.compute_dtype = compute_dtype - - self.sequence_parallel = self.parallel_config.use_sequence_parallel - self.transpose_b = transpose_b if self.expert_num <= 1 else False - - if self.sequence_parallel and self.tensor_parallel_group_size <= 1: - self.sequence_parallel = False - - weight_shape = (self.output_size_per_partition, self.input_size) if self.transpose_b else ( - self.input_size, self.output_size_per_partition) - if self.is_expert and self.expert_num > 1: - weight_shape = (self.expert_num,) + weight_shape - if check_valid_gmm_op(gmm_version='GroupedMatmulV4'): - self.matmul = ops.auto_generate.GroupedMatmulV4() - else: - self.matmul = ops.auto_generate.GroupedMatmul(split_item=3, group_type=0) - else: - self.matmul = P.MatMul(transpose_b=self.transpose_b) - with get_rng_tracer().rng_fork(TENSOR_PARALLEL_GENERATOR): - if not self.skip_weight_param_allocation: - self.weight = Parameter(initializer(weight_init, weight_shape, param_init_type), name="weight") - - if self.has_bias: - self.bias = Parameter( - initializer( - bias_init, (self.output_size_per_partition), param_init_type - ), - name="bias", - ) - self.bias_add = P.Add() - - self.cast = P.Cast() - self.shape = P.Shape() - self.reshape = P.Reshape() - - @jit - def construct(self, input_parallel, weight=None, group_list=None): - """ - Forward of ColumnParallelLinear. - Performs a linear transformation considering various parallel modes and data type conversions. - """ - - if weight is None and self.skip_weight_param_allocation: - raise ValueError("For ColumnParallelLinear, when skip_weight_param_allocation=True," - " weight should be passed to construct(), but got None.") - - origin_dtype = F.dtype(input_parallel) - if self.skip_weight_param_allocation: - weight = self.cast(weight, self.compute_dtype) - else: - weight = self.cast(self.weight, self.compute_dtype) - input_parallel = self.cast(input_parallel, self.compute_dtype) - - if self.sequence_parallel: - input_parallel = input_parallel.swapaxes(0, 1).contiguous() - input_parallel = self.gather_from_sp_region(input_parallel) - input_parallel = input_parallel.swapaxes(0, 1).contiguous() - - output_shape = self.shape(input_parallel)[:-1] + (self.output_size_per_partition,) - input_parallel = self.reshape(input_parallel, (-1, self.input_size)) - if self.is_expert and self.expert_num > 1: - if check_valid_gmm_op(gmm_version='GroupedMatmulV4'): - output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, None, - group_list, split_item=3, group_type=0, group_list_type=1)[0] - else: - output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, - group_list)[0] - - else: - output_parallel = self.matmul(input_parallel, weight) - if self.has_bias: - output_parallel = self.bias_add( - output_parallel, self.cast(self.bias, self.compute_dtype) - ) - output_parallel = self.cast(output_parallel, origin_dtype) - output_parallel = self.reshape(output_parallel, output_shape) - - if self.gather_output: - output = gather_from_model_parallel_region(output_parallel, self.tp_group) - else: - output = output_parallel - return output - - def sharded_state_dict(self): - """provide the sharded state dict based on the config""" - w_shard = (self.tensor_parallel_group_size, 1) if self.transpose_b else (1, self.tensor_parallel_group_size) - - if self.is_expert and self.expert_num > 1: - w_shard = (1, self.tensor_parallel_group_size, 1) if self.transpose_b \ - else (1, 1, self.tensor_parallel_group_size) - - state_dict = {} - if not self.skip_weight_param_allocation: - state_dict[self.weight.name] = {'shape': self.weight.shape, - 'shard': w_shard} - if self.has_bias: - state_dict[self.bias.name] = {'shape': self.bias.shape, - 'shard': (self.tensor_parallel_group_size,)} - return state_dict - - -class RowParallelLinear(nn.Cell): - r""" - The dense layer with weight sliced on first dimension by tensor parallel size. - This layer implements the operation as: - - .. math:: - \text{outputs} = \text{inputs} * \text{weight} + \text{bias}, - - where :math:`inputs` is the input tensors, :math:`\text{weight}` is a weight matrix created by the layer, - and :math:`\text{bias}` is a bias vector created by the layer (only if has_bias is True). - - Args: - input_size (int): The number of channels in the input space. - output_size (int): The number of channels in the output space. - config (dict): Parallel configuration. - input_is_parallel (bool): Specifies whether the input tensor has already been sliced on last dimension. - weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values - of str refer to the function `initializer`. Default: 'normal'. - bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The values - of str refer to the function `initializer`. Default: 'zeros'. - bias (bool): Specifies whether the layer uses a bias vector. Default: True. - skip_bias_add (bool): Specifies whether the layer doesn't need to add bias. Default: False. - is_expert (bool): Specifies whether this linear layer is an expert. Default: False. - transpose_b (bool): Specifies whether the weight parameter will be initialized as a transposed shape. - param_init_type (dtype.Number): The parameter initialization type. Default: mstype.float32. - compute_dtype (dtype.Number): The computation type. Default: mstype.float16. - expert_num (int): The number of expert. Default: 1. - tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs. - - Inputs: - - **x** (Tensor) - Tensor of shape :math:`(*, in\_channels)`. The `input_size` in `Args` should be equal - to :math:`in\_channels` in `Inputs`. - - Outputs: - Tensor of shape :math:`(*, out\_channels)`. - - Supported Platforms: - ``Ascend`` - """ - - def __init__( - self, - input_size, - output_size, - config, - input_is_parallel, - weight_init="normal", - bias_init="zeros", - bias=True, - skip_bias_add=False, - stride=1, - keep_master_weight_for_test=False, - is_expert=False, - tp_comm_buffer_name=None, - transpose_b=True, - param_init_type=mstype.float32, - compute_dtype=mstype.float16, - expert_num=1, - delay_allreduce=False, - tp_group=default_pgs, - ): - super(RowParallelLinear, self).__init__() - if stride > 1: - raise NotImplementedError("For ColumnParallelLinear, `stride > 1` is not supported for now, " - "but got `stride={}`".format(stride)) - if keep_master_weight_for_test: - raise NotImplementedError("For ColumnParallelLinear, `keep_master_weight_for_test=True` " - "is not supported for now.") - if tp_comm_buffer_name: - raise NotImplementedError("For ColumnParallelLinear, `tp_comm_buffer_name` is not supported for now.") - - self.input_size = input_size - self.output_size = output_size - self.has_bias = bias - self.skip_bias_add = skip_bias_add - self.input_is_parallel = input_is_parallel - self.tp_group = tp_group - self.tensor_parallel_group_size = self.tp_group.size - self.input_size_per_partition = divide(input_size, self.tensor_parallel_group_size) - self.parallel_config = config - self.compute_dtype = compute_dtype - self.sequence_parallel = self.parallel_config.use_sequence_parallel - self.expert_num = expert_num - self.is_expert = is_expert - self.transpose_b = transpose_b if self.expert_num <= 1 else False - self.delay_allreduce = delay_allreduce - - if self.sequence_parallel and not self.input_is_parallel: - raise RuntimeError( - "To enable `sequence_arallel`, `input_is_parallel` must be `True`" - ) - - if self.delay_allreduce and self.has_bias: - raise RuntimeError( - "In RowParallelLinear, `delay_allreduce` and `has_bias` cannot be enabled simultaneously, " - "otherwise the accuracy will be incorrect" - ) - - weight_shape = (self.output_size, self.input_size_per_partition) if self.transpose_b else ( - self.input_size_per_partition, self.output_size) - bias_shape = (self.output_size,) - if self.is_expert and self.expert_num > 1: - weight_shape = (self.expert_num,) + weight_shape - bias_shape = (1, self.expert_num, 1) + bias_shape - if check_valid_gmm_op(gmm_version='GroupedMatmulV4'): - self.matmul = ops.auto_generate.GroupedMatmulV4() - else: - self.matmul = ops.auto_generate.GroupedMatmul(split_item=3, group_type=0) - else: - self.matmul = P.MatMul(transpose_b=self.transpose_b) - with get_rng_tracer().rng_fork(TENSOR_PARALLEL_GENERATOR): - self.weight = Parameter( - initializer( - weight_init, - weight_shape, - param_init_type, - ), - name="weight", - ) - - if self.has_bias: - self.bias = Parameter(initializer(bias_init, bias_shape, param_init_type), name="bias") - self.bias_add = P.Add() - - self.shape = P.Shape() - self.reshape = P.Reshape() - self.cast = P.Cast() - - def construct(self, input_, group_list=None): - """ - Forward of RowParallelLinear. - Performs a linear transformation considering various parallel modes and data type conversions. - """ - - if self.input_is_parallel: - input_parallel = input_ - else: - input_parallel = scatter_to_model_parallel_region(input_, self.tp_group) - - origin_dtype = F.dtype(input_parallel) - weight = self.cast(self.weight, self.compute_dtype) - input_parallel = self.cast(input_parallel, self.compute_dtype) - output_shape = self.shape(input_parallel)[:-1] + (self.output_size,) - input_parallel = self.reshape(input_parallel, (-1, self.input_size_per_partition)) - if self.is_expert and self.expert_num > 1: - if check_valid_gmm_op(gmm_version='GroupedMatmulV4'): - output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, None, - group_list, split_item=3, group_type=0, group_list_type=1)[0] - else: - output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, - group_list)[0] - else: - output_parallel = self.matmul(input_parallel, weight) - - if self.sequence_parallel: - output_parallel = output_parallel.swapaxes(0, 1).contiguous() - output = reduce_scatter_to_model_parallel_region(output_parallel, self.tp_group) - output = output.swapaxes(0, 1).contiguous() - else: - if self.delay_allreduce or self.skip_bias_add: - output = output_parallel - else: - output = reduce_from_model_parallel_region(output_parallel, self.tp_group) - - if self.has_bias and not self.skip_bias_add: - output = self.bias_add(output, self.cast(self.bias, self.compute_dtype)) - output = self.cast(output, origin_dtype) - output = self.reshape(output, output_shape) - return output - - def sharded_state_dict(self): - """provide the sharded state dict based on the config""" - w_shard = (1, self.tensor_parallel_group_size) if self.transpose_b else (self.tensor_parallel_group_size, 1) - - if self.is_expert and self.expert_num > 1: - w_shard = (1, 1, self.tensor_parallel_group_size) if self.transpose_b \ - else (1, self.tensor_parallel_group_size, 1) - - state_dict = {} - state_dict[self.weight.name] = {'shape': self.weight.shape, - 'shard': w_shard} - if self.has_bias: - state_dict[self.bias.name] = {'shape': self.bias.shape, - 'shard': (1,)} - return state_dict - - -class VocabParallelEmbedding(nn.Cell): - """ - Embedding parallelized in the vocabulary dimension. - - Args: - num_embeddings: vocabulary size. - embedding_dim: size of hidden state. - parallel_config (Optional[Union[dict, ParallelContextConfig]]): - Parallel Config For Running Environment. Default: None. - init_method (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values - of str refer to the function `initializer`. Default: 'normal'. - init_type (dtype.Number): The parameter initialization type. Default: mstype.float32. - tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs. - """ - - def __init__( - self, - num_embeddings, - embedding_dim, - parallel_config, - init_method="normal", - init_type=mstype.float32, - tp_group=default_pgs, - ): - super().__init__() - self.num_embeddings = num_embeddings - self.embedding_dim = embedding_dim - self.sequence_parallel = parallel_config.use_sequence_parallel - - self.tp_group = tp_group - self.tensor_parallel_group_size = self.tp_group.size - rank = self.tp_group.rank - - self.vocab_start_index, self.vocab_end_index = self._vocab_range_from_global_vocab_size( - self.num_embeddings, rank, self.tensor_parallel_group_size) - self.num_embeddings_per_partition = self.vocab_end_index - self.vocab_start_index - - with get_rng_tracer().rng_fork(): - self.embedding_weight = Parameter( - initializer( - init=init_method, - shape=(self.num_embeddings_per_partition, self.embedding_dim), - dtype=init_type, - ), - name="embedding_weight", - ) - self.max_index_per_partition = Tensor(self.num_embeddings_per_partition - 1, dtype=mstype.int32) - self.expand_dims = ops.ExpandDims() - self.gather = ops.Gather() - - def construct(self, x): - """ - Forward of VocabParallelEmbedding. - Computes embeddings with optional masking and parallel reduction based on the model parallel size. - """ - - if self.tensor_parallel_group_size > 1: - displaced_x = mint.sub(x, self.vocab_start_index) - down_truncated_x = mint.nn.functional.relu(displaced_x) - truncated_x = mint.minimum(down_truncated_x, self.max_index_per_partition) - input_mask = mint.eq(displaced_x, truncated_x) - input_mask = self.expand_dims(input_mask, -1) - else: - input_mask = None - truncated_x = x - # Get the embeddings. - # 'embedding' has dynamic shape issue, use gather instead now. - output_parallel = self.gather(self.embedding_weight, truncated_x, 0) - # Mask the output embedding. - if self.tensor_parallel_group_size > 1: - output_parallel = mint.mul(output_parallel, input_mask) - - if self.sequence_parallel: - output_parallel = output_parallel.swapaxes(0, 1).contiguous() - output = reduce_scatter_to_model_parallel_region(output_parallel, self.tp_group) - output = output.swapaxes(0, 1).contiguous() - else: - # Reduce across all the model parallel devices. - output = reduce_from_model_parallel_region(output_parallel, self.tp_group) - return output - - def _vocab_range_from_global_vocab_size(self, global_vocab_size, rank, world_size): - if global_vocab_size % world_size != 0: - raise ValueError(f"The vocabulary size is {global_vocab_size}," - f"which is not divisible by size of tensor parallel({world_size}).") - per_partition_vocab_size = divide(global_vocab_size, world_size) - index_f = rank * per_partition_vocab_size - index_l = index_f + per_partition_vocab_size - return index_f, index_l - - def sharded_state_dict(self): - """provide the sharded state dict based on the config""" - w_shard = (self.tensor_parallel_group_size, 1) - state_dict = {} - state_dict[self.embedding_weight.name] = {'shape': self.embedding_weight.shape, - 'shard': w_shard} - - return state_dict diff --git a/research/llama3_1/infer/norm.py b/research/llama3_1/infer/norm.py deleted file mode 100644 index 7c7e5d53fe02e9ebe53d6c65764ed31ff3f10ecb..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/norm.py +++ /dev/null @@ -1,92 +0,0 @@ -# Copyright 2025 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -Normalization -""" -import mindspore.common.dtype as mstype -import mindspore.ops.operations as P -from mindspore import Parameter, nn -from mindspore.common.initializer import initializer - -from mindformers.version_control import check_rmsnorm_big_kernel_valid - - -class RMSNorm(nn.Cell): - r""" - A self-defined RMSNorm operation using reduce mean. - - Args: - dim (tuple): The shape of the input tensor - eps (float): The epsilon value of the denominator. Default 1e-5. - compute_type: The compute type. - Inputs: - - **x** (Tensor) - Tensor of shape :math:`(batch, seq\_length, hidden\_size)`. - - Outputs: - Tensor of shape :math:`(batch, seq_length, hidden_size)`. - """ - - def __init__(self, dim, eps=1e-6, compute_type=mstype.float32): - super().__init__() - self.eps = eps - self.compute_type = compute_type - self.weight = Parameter(initializer('ones', (dim,), dtype=self.compute_type), parallel_optimizer=False) - - if check_rmsnorm_big_kernel_valid(): - self.norm = P.RmsNorm(eps) - self.rms_norm = self._rms_norm - self.self_define = False - self.cast = P.Cast() - self.rcast = P.Cast() - else: - self.cast = P.Cast() - self.mul = P.Mul() - self.mul2 = P.Mul() - self.square = P.Square() - self.mean = P.ReduceMean(keep_dims=True) - self.add = P.Add() - self.rsqrt = P.Rsqrt() - self.rms_norm = self._self_norm - self.self_define = True - - def _self_norm(self, x): - original_type = x.dtype - norm_factor = self.square(self.cast(x, self.compute_type)) - norm_factor = self.mean(norm_factor, -1) - norm_factor = self.add(norm_factor, self.eps) - norm_factor = self.rsqrt(norm_factor) - output = self.mul(x, self.cast(norm_factor, original_type)) - output = self.mul2(output, self.cast(self.weight, original_type)) - return output - - def _rms_norm(self, x): - original_type = x.dtype - output = self.norm(self.cast(x, self.compute_type), self.weight)[0] - return self.rcast(output, original_type) - - def construct(self, x): - """Forward of RMSNorm.""" - return self.rms_norm(x) - - def sharded_state_dict(self): - """provide the sharded state dict based on the config""" - w_shard = (1,) - state_dict = {} - state_dict[self.weight.name] = {'shape': self.weight.shape, - 'shard': w_shard} - return state_dict diff --git a/research/llama3_1/infer/parallel_paged_attention_mgr.py b/research/llama3_1/infer/parallel_paged_attention_mgr.py deleted file mode 100644 index 76751f1fa7e3a617e2fb43a271e8edd01c856ed3..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/parallel_paged_attention_mgr.py +++ /dev/null @@ -1,91 +0,0 @@ -# Copyright 2025 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -Paged Attention Manager for inference. -""" -import math - -import mindspore.common.dtype as mstype -from mindspore import ops as P -from mindspore import nn -from mindformers.parallel_core.inference.utils import create_empty_parameter - - -class ParallelPagedAttentionMgr(nn.Cell): - """Paged Attention Manager.""" - def __init__(self, - n_heads, - head_dim, - n_kv_heads, - kv_shape, - seq_length=-1, - compute_dtype=mstype.float16, - npu_mem_size=2): - super().__init__() - self.n_heads = n_heads - self.head_dim = head_dim - self.n_kv_heads = n_kv_heads - self.seq_length = seq_length - self.is_first_iteration = True - self.scale_value = 1 / math.sqrt(self.head_dim) - self.key_cache = None - self.value_cache = None - self.npu_mem_size = npu_mem_size - if self.npu_mem_size > 0: - self.key_cache = create_empty_parameter( - shape=kv_shape, - dtype=compute_dtype, - device="Ascend", - name="key_cache", - requires_grad=False, - ) - self.value_cache = create_empty_parameter( - shape=kv_shape, - dtype=compute_dtype, - device="Ascend", - name="value_cache", - requires_grad=False, - ) - - self.reshape_and_cache = P.auto_generate.ReshapeAndCache() - self.paged_attention = P.auto_generate.PagedAttention(self.n_heads, - self.scale_value, - self.n_kv_heads) - self.paged_attention_with_alibi = P.auto_generate.PagedAttentionMask(self.n_heads, - self.scale_value, - self.n_kv_heads) - - def construct(self, key, value, slot_mapping, _, key_cache=None, value_cache=None): - """The forward compute of KVCache for Paged Attention.""" - if self.npu_mem_size == -1: - return self.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping) - return self.reshape_and_cache(key, value, self.key_cache, self.value_cache, slot_mapping) - - def paged_attn(self, query, batch_valid_length, block_tables, attn_mask=None, q_seq_lens=None, - key_cache=None, value_cache=None): - if self.npu_mem_size == -1: - return self._paged_attn(query, batch_valid_length, block_tables, attn_mask, q_seq_lens, - key_cache, value_cache) - return self._paged_attn(query, batch_valid_length, block_tables, attn_mask, q_seq_lens, - self.key_cache, self.value_cache) - - def _paged_attn(self, query, batch_valid_length, block_tables, attn_mask=None, q_seq_lens=None, - key_cache=None, value_cache=None): - """The forward compute of Paged Attention.""" - return self.paged_attention(query, key_cache, value_cache, block_tables, batch_valid_length, - None, None, attn_mask, q_seq_lens) diff --git a/research/llama3_1/infer/random.py b/research/llama3_1/infer/random.py deleted file mode 100644 index 66b3f91595a653544d1e5b073194ddb16251778b..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/random.py +++ /dev/null @@ -1,100 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -RNGStateTracer -""" -from contextlib import contextmanager -try: - from mindspore import manual_seed, get_rng_state, set_rng_state -except ImportError: - from mindspore.nn.generator import manual_seed, get_rng_state, set_rng_state - - -DATA_PARALLEL_GENERATOR = "dp_rng_generator" -TENSOR_PARALLEL_GENERATOR = "tp_rng_generator" -EXPERT_PARALLEL_GENERATOR = "exp_rng_generator" -IS_SEED_SET = False -CANDIDATE_MODES = [DATA_PARALLEL_GENERATOR, TENSOR_PARALLEL_GENERATOR, EXPERT_PARALLEL_GENERATOR] - - -class RNGStateTracer: - """ - Examples: - >>> with rngstatetracer.rng_fork(): - >>> tensor = mint.normal(mean, std) - >>> ... - """ - def __init__(self): - self.reset() - - def reset(self): - self._states = {} - - def set_state(self, states): - self._states = states - - def get_state(self): - states = {} - for mode in self._states: - states[mode] = self._states[mode] - return states - - def init_mode(self, mode, seed): - "initialize mode and seed where mode should not be duplicate, otherwise reset should be called first" - if mode in self._states: - # if mode exists, raise exception - raise ValueError(f"init generator with existed mode {mode}") - # save current state, set and record target state, then restore old state - orig_rng_state = get_rng_state() - manual_seed(seed) - self._states[mode] = get_rng_state() - set_rng_state(orig_rng_state) - - # pylint: disable=W0101 - @contextmanager - def rng_fork(self, mode=TENSOR_PARALLEL_GENERATOR): - "fork rng state if seed is already set, otherwise keep the rng state unchanged" - if not IS_SEED_SET: - yield - return - # if mode not exists, raise exception - if mode not in self._states: - raise ValueError(f"not initialize or the parallel mode {mode} not exists ") - # save current state, then set target state - orig_rng_state = get_rng_state() - set_rng_state(self._states[mode]) - try: - # yield to do job - yield - finally: - # restore old state - self._states[mode] = get_rng_state() - set_rng_state(orig_rng_state) - -default_rng_tracer_ = None - - -def _init_default_rng_tracer(): - global default_rng_tracer_ - default_rng_tracer_ = RNGStateTracer() - - -def get_rng_tracer(): - if default_rng_tracer_ is None: - _init_default_rng_tracer() - return default_rng_tracer_ diff --git a/research/llama3_1/infer/scale_mask_softmax.py b/research/llama3_1/infer/scale_mask_softmax.py deleted file mode 100644 index 1bd2c7561c1470cba4b8d0a4f4627b0037bbc5ce..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/scale_mask_softmax.py +++ /dev/null @@ -1,68 +0,0 @@ -# Copyright 2025 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -ScaleMaskSoftmax -""" -import mindspore.ops.functional as F - -from mindspore import mint, ops, nn -from mindspore.common import dtype as mstype - - -class ScaleMaskSoftmax(nn.Cell): - r""" - fused operation: scaling + mask + softmax - - Args: - mask_func: mask function to be applied. - scale: scaling factor used in input tensor scaling. - softmax_compute_type: softmax in performed precision. - - Inputs: - - **x** (Tensor) - The input tensor - - **mask** (Tensor) - The mask tensor - - Outputs: - - The output tensor. - """ - - def __init__(self, mask_func, scale=None, softmax_compute_type=mstype.float32): - super().__init__() - self.mask_func = mask_func - self.softmax_compute_type = softmax_compute_type - self.scale = scale - - if self.scale is not None and self.softmax_compute_type != mstype.float32: - raise ValueError("softmax should be in fp32 when scaled") - - def construct(self, x, mask): - """construct method""" - origin_dtype = F.dtype(x) - if self.softmax_compute_type != origin_dtype: - x = ops.cast(x, self.softmax_compute_type) - - if self.scale is not None: - x = x * self.scale - masked_input = self.mask_func(x, mask) if mask is not None else x - - probs = mint.nn.functional.softmax(masked_input, dim=-1) - - if self.softmax_compute_type != origin_dtype: - probs = ops.cast(probs, origin_dtype) - - return probs diff --git a/research/llama3_1/infer/transformer.py b/research/llama3_1/infer/transformer.py deleted file mode 100644 index 45a194e2ec54c4f41964b36127d7a45c0523abee..0000000000000000000000000000000000000000 --- a/research/llama3_1/infer/transformer.py +++ /dev/null @@ -1,888 +0,0 @@ -# Copyright 2025 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -For transformer -""" -import math -import os - -import numpy as np - -import mindspore.common.dtype as mstype -from mindspore import Parameter, Tensor, mint, nn, ops -from mindspore.common.initializer import initializer - -from mindformers.parallel_core.inference.utils import divide, get_attn_mask_func -from mindformers.parallel_core.inference.transformer.activation import get_act_func -from mindformers.parallel_core.process_group_config import default_model_comm_pgs -from mindformers.modules.flash_attention import FlashAttention -from mindformers.modules.infer_attention import InferRotaryEmbedding -from mindformers.modules.layers import FreqsMgr, RotaryEmbedding -from mindformers.modules.transformer import LowerTriangularMaskWithDynamic -from mindformers.version_control import need_nz - -from research.llama3_1.infer.norm import RMSNorm -from research.llama3_1.infer.parallel_paged_attention_mgr import ParallelPagedAttentionMgr -from research.llama3_1.infer.scale_mask_softmax import ScaleMaskSoftmax -from research.llama3_1.infer.layers import ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding - - -class VocabEmbedding(nn.Cell): - """ - Embedding Layer. - - Args: - - **num_embeddings** (int): Size of the dictionary of embeddings. - - **embedding_dim** (int): The size of each embedding vector. - - **param_init_type** (mstype): The param init type, default mstype.float32. - - **param_init** (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the embedding_table. - Refer to class `initializer` for the values of string when a string - is specified. Default: 'normal'. - Inputs: - - **input_ids** (Tensor) - The tokenized inputs with datatype int32 with shape (batch_size, seq_length) - - Outputs: - - **output** (Tensor) - The embedding vector for the input with shape (batch_size, - seq_length, embedding_size). - """ - - def __init__(self, num_embeddings, embedding_dim, param_init_type=mstype.float32, param_init='normal', - parallel_optimizer=False): - super().__init__() - self.num_embeddings = num_embeddings - self.embedding_dim = embedding_dim - self.embedding_weight = Parameter( - initializer(param_init, [self.num_embeddings, self.embedding_dim], dtype=param_init_type), - name='embedding_weight', parallel_optimizer=parallel_optimizer) - self.gather = ops.Gather() - - def construct(self, input_ids): - """Forward of vocab embedding.""" - # 'embedding' has dynamic shape issue, use gather instead now. - output = self.gather(self.embedding_weight, input_ids, 0) - return output - - -class ParallelMLP(nn.Cell): - r""" - Implementation of parallel feedforward block. - - Args: - config (dict): Configuration. - is_expert (book): This block is an expert block. Default: False. - model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group. - Default: default_model_comm_pgs. - - Inputs: - - **hidden_states** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - Outputs: - - **output** (Tensor) - Output tensor of shape :math:`(B, S, H)`. - - Supported Platforms: - ``Ascend`` - """ - - def __init__(self, config, is_expert=False, model_comm_pgs=default_model_comm_pgs): - super().__init__(config) - if is_expert: - raise NotImplementedError("For ParallelMLP, `is_expert` is not supported for now.") - self.config = config - self.has_bias = self.config.mlp_has_bias - self.hidden_size = self.config.hidden_size - self.ffn_hidden_size = self.config.ffn_hidden_size - self.mlp_has_gate = self.config.mlp_has_gate - self.ffn_concat = self.config.ffn_concat - - self.tp = model_comm_pgs.tp - tp_group_size = self.tp.size - self.ffn_hidden_size_per_partition = divide(self.ffn_hidden_size, tp_group_size) - - if self.mlp_has_gate: - if self.ffn_concat: - self.w_gate_hidden = ColumnParallelLinear( - self.hidden_size, - self.ffn_hidden_size * 2, - config=self.config.parallel_config, - bias=self.has_bias, - transpose_b=True, - gather_output=False, - is_expert=is_expert, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - else: - self.w1 = ColumnParallelLinear( - self.hidden_size, - self.ffn_hidden_size, - config=self.config.parallel_config, - bias=self.has_bias, - transpose_b=True, - gather_output=False, - is_expert=is_expert, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - self.w3 = ColumnParallelLinear( - self.hidden_size, - self.ffn_hidden_size, - config=self.config.parallel_config, - bias=self.has_bias, - transpose_b=True, - gather_output=False, - is_expert=is_expert, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - else: - self.w1 = ColumnParallelLinear( - self.hidden_size, - self.ffn_hidden_size, - config=self.config.parallel_config, - bias=self.has_bias, - transpose_b=True, - gather_output=False, - is_expert=is_expert, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - - self.act_type = self.config.hidden_act - self.act_func = get_act_func(self.act_type) - - # Project back to h. - self.w2 = RowParallelLinear( - self.ffn_hidden_size, - self.hidden_size, - input_is_parallel=True, - config=self.config.parallel_config, - bias=self.has_bias, - transpose_b=True, - is_expert=is_expert, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - self.mul = ops.Mul() - self.reshape = ops.Reshape() - - def construct(self, x): - """ Construct function of mlp block. """ - # [B, S, H] -> [B, S, ffn_H] - if self.mlp_has_gate: - if self.ffn_concat: - gate_hidden_out = self.w_gate_hidden(x) # dp,1 -> dp, mp # dp,1 -> dp, mp - gate_hidden_out_shape = gate_hidden_out.shape - reshape_out = self.reshape(gate_hidden_out, - (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition, 2)) - gate, hidden = mint.split(reshape_out, - (1, 1), -1) - gate = self.reshape(gate, (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition)) - hidden = self.reshape(hidden, (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition)) - else: - gate = self.w1(x) # dp,1 -> dp, mp - hidden = self.w3(x) # dp,1 -> dp, mp - gate = self.act_func(gate) - hidden = mint.mul(hidden, gate) - else: - hidden = self.w1(x) - hidden = self.act_func(hidden) - - # [B, S, ffn_H] -> [B, S, H] - output = self.w2(hidden) - return output - - -class CoreAttention(nn.Cell): - r""" - Get the weighted score along the seq_length. - - Args: - layer_number (int): Number which indicates the index of this transformer layer in the - whole transformer block. - config (dict): Configuration. - attn_type (str): Attention type. Support ['self_attn', 'cross_attn']. Default: 'self_attn'. - - Inputs: - - **query** (Tensor) - Tensor of query matrix. - - **key** (Tensor) - Tensor of key matrix. - - **value** (Tensor) - Tensor of value matrix. - - **attention_mask** (Tensor) - Tensor of attention mask matrix. - - Outputs: - - **attn_output** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - Supported Platforms: - ``Ascend`` - """ - - def __init__(self, layer_number, config, attn_mask_type=None): - super(CoreAttention, self).__init__() - if attn_mask_type: - raise NotImplementedError("For CoreAttention, `attn_mask_type` is not supported for now.") - self.config = config - self.layer_index = max(1, layer_number) - self.compute_dtype = self.config.compute_dtype - self.softmax_compute_dtype = self.config.softmax_compute_dtype - self.sequence_parallel = self.config.parallel_config.use_sequence_parallel - self.apply_query_key_layer_scaling = self.config.apply_query_key_layer_scaling - self.num_heads = self.config.num_heads - self.hidden_size = self.config.hidden_size - self.head_dim = divide(self.hidden_size, self.num_heads) - - coeff = None - norm_factor = math.sqrt(self.head_dim) - if self.apply_query_key_layer_scaling: - coeff = self.layer_index - norm_factor *= coeff - self.inv_norm_factor = Tensor(1.0 / norm_factor, dtype=self.compute_dtype) - - self.mask_func = get_attn_mask_func(self.config.mask_func_type) - self.scale_mask_softmax = ScaleMaskSoftmax(self.mask_func, - softmax_compute_type=self.softmax_compute_dtype) - - self.attention_dropout = mint.nn.Dropout(p=self.config.attention_dropout_rate) - - def construct(self, query_layer, key_layer, value_layer, attention_mask): - """ - Computes the attention scores, applies the attention mask, and returns the weighted - sum of the value layer based on the attention probabilities. - - Inputs: - ---------- - query_layer : Tensor - The query tensor of shape [B, N, S_q, D]. - key_layer : Tensor - The key tensor of shape [B, N, S_k, D]. - value_layer : Tensor - The value tensor of shape [B, N, S_k, D]. - attention_mask : Tensor - The attention mask tensor of shape [B, N, S_q, S_k]. - - Returns: - ------- - Tensor - The attention output tensor of shape [B, N, S_q, D]. - """ - # score shape: [B, N, S_q, S_k] - score = ops.bmm(query_layer, key_layer.transpose(0, 1, 3, 2)) - score = mint.mul(score, self.inv_norm_factor) - - # attention scores and attention mask [B, N, S_q, S_k] - attention_probs = self.scale_mask_softmax(score, attention_mask) - - attention_probs = self.attention_dropout(attention_probs) - - # [B, N, S_q, S_k] * [B, N, S_v, D] -> [B, N, S_q, D] - weighted_values = ops.bmm(attention_probs, value_layer) - - return weighted_values - - -class ParallelAttention(nn.Cell): - r""" - Parallel attention block. - - Args: - layer_index (int): Number which indicates the index of this transformer layer in the - whole transformer block. - config (dict): Configuration. - attn_type (str): Attention type. Support ['self_attn', 'cross_attn']. Default: 'self_attn'. - model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group. - Default: default_model_comm_pgs. - - Inputs: - - **hidden_states** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - **attention_mask** (Tensor) - Tensor of attention mask. - - **encoder_output** (Tensor) - Tensor of encoder output used for cross attention. Default: None. - - **rotary_pos_emb** (Tensor) - Tensor of rotary position embedding. Default: None. - - Outputs: - - **output** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - Supported Platforms: - ``Ascend`` - """ - - def __init__(self, config, layer_number, attention_type="self_attn", attn_mask_type=None, - model_comm_pgs=default_model_comm_pgs): - super().__init__(config) - if attn_mask_type: - raise NotImplementedError("For ParallelAttention, `attn_mask_type` is not supported for now.") - self.config = config - self.layer_index = max(1, layer_number) - self.param_init_dtype = self.config.param_init_dtype - self.compute_dtype = self.config.compute_dtype - self.is_first_iteration = True - self.use_past = self.config.use_past - self.qkv_concat = self.config.qkv_concat - - self.attn_type = attention_type - self.num_heads = self.config.num_heads - self.kv_num_heads = self.num_heads if config.n_kv_heads is None else config.n_kv_heads - self.hidden_size = self.config.hidden_size - self.head_dim = divide(self.hidden_size, self.num_heads) - self.kv_hidden_size = self.head_dim * self.kv_num_heads - self.n_rep = divide(self.num_heads, self.kv_num_heads) - - self.sequence_parallel = self.config.parallel_config.use_sequence_parallel - self.use_flash_attention = self.config.use_flash_attention - self.norm_factor = math.sqrt(self.head_dim) - - self.tp = model_comm_pgs.tp - self.tp_group_size = self.tp.size - self.num_heads_per_partition = divide(self.num_heads, self.tp_group_size) - - self.use_gqa = (self.num_heads != self.kv_num_heads) - - if self.use_gqa: - self._check_gqa_valid() - self.kv_num_heads_per_partition = divide(self.kv_num_heads, self.tp_group_size) - self.repeat_num = divide(self.num_heads, self.kv_num_heads) - else: - self.kv_num_heads_per_partition = self.num_heads_per_partition - - if self.attn_type == "self_attn": - self._init_self_attn() - elif self.attn_type == "cross_attn": - self._init_cross_attn() - else: - raise NotImplementedError( - f"attention_type(str) should be 'self_attn' or 'cross_attn', but got {self.attn_type}") - self.reshape = ops.Reshape() - self.cast = ops.Cast() - self.wo = RowParallelLinear( - self.hidden_size, - self.hidden_size, - input_is_parallel=True, - config=self.config.parallel_config, - bias=self.config.out_proj_has_bias, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - - if self.use_flash_attention: - input_layout = "TH" if self.use_past else "BNSD" - self.flash_attention = FlashAttention(head_num=self.num_heads_per_partition, - scale_value=1.0 / self.norm_factor, - next_tokens=0, - input_layout=input_layout) - else: - self.core_attention = CoreAttention(self.layer_index, self.config) - - if self.use_past: - if need_nz(): - kv_shape = (self.config.num_blocks, self.config.block_size, - self.kv_num_heads_per_partition * self.head_dim) - else: - kv_shape = (self.config.num_blocks, self.config.block_size, - self.kv_num_heads_per_partition, self.head_dim) - self.npu_mem_size = config.npu_mem_size if hasattr(config, "npu_mem_size") else 2 - self.paged_attention_mgr = ParallelPagedAttentionMgr(self.num_heads_per_partition, - self.head_dim, - self.kv_num_heads_per_partition, - kv_shape, - config.seq_length, - compute_dtype=self.compute_dtype, - npu_mem_size=self.npu_mem_size) - self.rotary_embedding = InferRotaryEmbedding(rotary_cos_format=2) - else: - self.apply_rotary_emb = RotaryEmbedding(self.head_dim, config.rotary_dtype) - - def construct(self, x, batch_valid_length, block_tables, slot_mapping, freqs_cis=None, - attn_mask=None, alibi_mask=None, encoder_output=None, prefix_keys_values=None, - q_seq_lens=None, key_cache=None, value_cache=None): - """Construct function of attention block.""" - # hidden states: [B, S, H] - # apply query, key, value projection - if self.attn_type == "self_attn": - if self.qkv_concat: - qkv = self.cast(self.w_qkv(x), self.compute_dtype) - reshape_qkv = self.reshape(qkv, - (-1, - self.kv_num_heads_per_partition, - (self.n_rep + 2) * self.head_dim)) - query, key, value = mint.split(reshape_qkv, - (self.head_dim * self.n_rep, - self.head_dim, - self.head_dim), -1) - if self.use_past: - query = self.reshape(query, (-1, self.hidden_size_per_partition)) - key = self.reshape(key, (-1, self.kv_hidden_size_per_partition)) - value = self.reshape(value, (-1, self.kv_hidden_size_per_partition)) - else: - query = self.cast(self.wq(x), self.compute_dtype) - key = self.cast(self.wk(x), self.compute_dtype) - value = self.cast(self.wv(x), self.compute_dtype) - if not self.use_past: - # [B, S, H] --> [B, S, N, D] - bs, seq_len, _ = x.shape - query = self.reshape(query, (bs, seq_len, self.num_heads_per_partition, self.head_dim)) - key = self.reshape(key, (bs, seq_len, self.kv_num_heads_per_partition, self.head_dim)) - value = self.reshape(value, (bs, seq_len, self.kv_num_heads_per_partition, self.head_dim)) - else: - query = self.cast(self.wq(x), self.compute_dtype) - if self.qkv_concat: - kv = self.cast(self.w_kv(encoder_output), self.compute_dtype) - key, value = mint.split(kv, (self.kv_hidden_size_per_partition, self.kv_hidden_size_per_partition), -1) - else: - key = self.cast(self.wk(encoder_output), self.compute_dtype) - value = self.cast(self.wv(encoder_output), self.compute_dtype) - - # qkv shape: [B, S, H] - if self.use_past: - if freqs_cis is not None: - query, key = self.rotary_embedding(query, key, freqs_cis, batch_valid_length) - - if prefix_keys_values is not None: - prefix_len = prefix_keys_values.shape[2] - slot_mapping = slot_mapping + self.cast(mint.ne(slot_mapping, -1), mstype.int32) * prefix_len - if self.is_first_iteration: - key, value = self._cat_prefix(key, value, prefix_keys_values) - - key_out = self.paged_attention_mgr(key, value, slot_mapping, batch_valid_length, - key_cache=key_cache, value_cache=value_cache) - query = ops.depend(query, key_out) - - if self.is_first_iteration: - if self.use_flash_attention: - context_layer = self.flash_attention(query, key, value, attn_mask, alibi_mask, None, None, - q_seq_lens, batch_valid_length) - else: - bs, seq_len, _ = x.shape - # [B, S, H] --> [B, S, N, D] - query = query.reshape(bs, seq_len, -1, self.head_dim) - key = key.reshape(bs, seq_len, -1, self.head_dim) - value = value.reshape(bs, seq_len, -1, self.head_dim) - # [B, S, N_kv, D] --> [B, S, N, D] - if self.use_gqa: - key = mint.repeat_interleave(key, repeats=self.repeat_num, dim=2) - value = mint.repeat_interleave(value, repeats=self.repeat_num, dim=2) - # [B, S, N, D] --> [B, N, S, D] - query = query.transpose(0, 2, 1, 3) - key = key.transpose(0, 2, 1, 3) - value = value.transpose(0, 2, 1, 3) - context_layer = self.core_attention(query, key, value, attn_mask) - # [B, N, S, D] --> [B, S, H] - context_layer = context_layer.transpose(0, 2, 1, 3).reshape( - bs, seq_len, self.hidden_size_per_partition) - else: - context_layer = self.paged_attention_mgr.paged_attn(query, batch_valid_length, block_tables, - attn_mask, q_seq_lens, key_cache, value_cache) - - # qkv shape: [B, S, N, D] - else: - bs, seq_len, _ = x.shape - # [B, S, N, D] --> [B, N, S, D] - query = query.transpose(0, 2, 1, 3) - key = key.transpose(0, 2, 1, 3) - value = value.transpose(0, 2, 1, 3) - if freqs_cis is not None: - query, key = self.apply_rotary_emb(query, key, freqs_cis) - if self.use_flash_attention: - if os.getenv('RUN_MODE') == 'predict': - raise NotImplementedError( - "Conflict detected in predict mode: " - "Flash Attention is incompatible when use_past=False") - context_layer = self.flash_attention(query, key, value, attn_mask) - else: - # [B, N_kv, S, D] --> [B, N, S, D] - if self.use_gqa: - key = mint.repeat_interleave(key, repeats=self.repeat_num, axis=1) - value = mint.repeat_interleave(value, repeats=self.repeat_num, axis=1) - context_layer = self.core_attention(query, key, value, attn_mask) - # [B, N, S, D] --> [B, S, H] - context_layer = context_layer.transpose(0, 2, 1, 3).reshape( - bs, seq_len, self.hidden_size_per_partition) - - # apply output projection - output = self.wo(context_layer) - output = self.cast(output, x.dtype) - - return output - - def _cat_prefix(self, key, value, prefix_keys_values): - """ - concat prefix_keys_values to key and value - prefix_keys_values: shape(2, bs, pre_len, num_heads * kv_channels) - """ - if prefix_keys_values is not None: - past_key = prefix_keys_values[0] - past_value = prefix_keys_values[1] - past_key = self.cast(past_key, key.dtype) - past_value = self.cast(past_value, value.dtype) - key = ops.concat((past_key, key), 1) - value = ops.concat((past_value, value), 1) - return key, value - - def _check_gqa_valid(self): - """check whether the config is valid for grouped-query-attention""" - if self.num_heads % self.kv_num_heads != 0: - raise ValueError( - f"num_heads must be divisible by kv_num_heads, " - f"but got num_heads {self.num_heads} and kv_num_heads {self.kv_num_heads}" - ) - if self.kv_num_heads % self.tp_group_size != 0: - raise ValueError( - f"kv_num_heads must be divisible by tp_group_size, " - f"but got kv_num_heads {self.kv_num_heads} and kv_num_heads {self.tp_group_size}" - ) - - def _init_self_attn(self): - """init qkv linears of self-attention""" - self.hidden_size_per_partition = divide(self.hidden_size, self.tp_group_size) - self.kv_hidden_size_per_partition = divide(self.kv_hidden_size, self.tp_group_size) - if self.qkv_concat: - self.w_qkv = ColumnParallelLinear( - self.hidden_size, - self.hidden_size + 2 * self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - else: - self.wq = ColumnParallelLinear( - self.hidden_size, - self.hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - self.wk = ColumnParallelLinear( - self.hidden_size, - self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - self.wv = ColumnParallelLinear( - self.hidden_size, - self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - tp_group=self.tp, - ) - - def _init_cross_attn(self): - """init qkv linears of cross-attention""" - if self.hidden_size != self.kv_hidden_size: - raise ValueError("hidden_size must be equal to kv_hidden_size!") - self.wq = ColumnParallelLinear( - self.hidden_size, - self.hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - ) - if self.qkv_concat: - self.w_kv = ColumnParallelLinear( - self.hidden_size, - 2 * self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - ) - else: - self.wk = ColumnParallelLinear( - self.hidden_size, - self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - ) - self.wv = ColumnParallelLinear( - self.hidden_size, - self.kv_hidden_size, - config=self.config.parallel_config, - bias=self.config.qkv_has_bias, - gather_output=False, - transpose_b=True, - param_init_type=self.config.param_init_dtype, - compute_dtype=self.config.compute_dtype, - ) - - -class ParallelTransformerLayer(nn.Cell): - r""" - Single parallel transformer layer. - - Args: - config (dict): Configuration. - layer_index (int): Number which indicates the index of this transformer layer in the - whole transformer block. - model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group. - Default: default_model_comm_pgs. - - Inputs: - - **x** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - **attention_mask** (Tensor) - Tensor of attention mask. - - **rotary_pos_emb** (Tensor) - Tensor of rotary position embedding. Default: None. - - Outputs: - - **output** (Tensor) - Tensor of shape :math:`(B, S, H)`. - - Supported Platforms: - ``Ascend`` - """ - - def __init__( - self, - config, - layer_number: int, - layer_type=None, - self_attn_mask_type=None, - drop_path_rate: float = 0.0, - model_comm_pgs=default_model_comm_pgs, - ): - super().__init__(config) - if layer_type: - raise NotImplementedError("For ParallelTransformerLayer, only decoder only structure is supported for now.") - if self_attn_mask_type: - raise NotImplementedError("For ParallelTransformerLayer, `self_attn_mask_type` is not supported for now.") - if drop_path_rate > 0.0: - raise NotImplementedError( - "For ParallelTransformerLayer, `drop_path_rate > 0` is not supported for now, " - "but got `drop_path_rate={}`".format(drop_path_rate) - ) - self.config = config - self.apply_residual_connection_post_norm = self.config.apply_residual_connection_post_norm - # Normalize the input data. - self.attention_norm = RMSNorm(dim=config.hidden_size, - eps=config.layernorm_epsilon, - compute_type=config.layernorm_compute_dtype) - # Attention. - self.attention = ParallelAttention(config, layer_number, model_comm_pgs=model_comm_pgs) - # Normalize the attention output - self.ffn_norm = RMSNorm(dim=config.hidden_size, - eps=config.layernorm_epsilon, - compute_type=config.layernorm_compute_dtype) - # MLP - self.feed_forward = ParallelMLP(config, model_comm_pgs=model_comm_pgs) - - def construct(self, x, freqs_cis=None, mask=None, batch_valid_length=None, block_tables=None, - slot_mapping=None, prefix_keys_values=None, q_seq_lens=None, key_cache=None, value_cache=None): - """Construct function of transformer layer.""" - # hidden states: [B, S, H] - # norm at the beginning of the transformer layer. - norm_output = self.attention_norm(x) - # attention. - attention_output = self.attention(norm_output, batch_valid_length, block_tables, slot_mapping, freqs_cis, - mask, prefix_keys_values=prefix_keys_values, - q_seq_lens=q_seq_lens, key_cache=key_cache, value_cache=value_cache) - # residual-connection. - if self.apply_residual_connection_post_norm: - residual = norm_output - else: - residual = x - norm_input = ops.add(residual, attention_output) - # layernorm post attention. - norm_output = self.ffn_norm(norm_input) - # MLP. - mlp_output = self.feed_forward(norm_output) - # residual-connection. - if self.apply_residual_connection_post_norm: - residual = norm_output - else: - residual = norm_input - output = ops.add(residual, mlp_output) - return output - - -class ParallelTransformer(nn.Cell): - r""" - Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`ParallelTransformerLayer`] - Args: - config: the config of transformer - model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group. - Default: default_model_comm_pgs. - - Returns: - output: Tensor, the output of transformerlayer - """ - - def __init__( - self, - config, - model_type=None, - layer_type=None, - self_attn_mask_type=None, - post_norm: bool = True, - pre_process=False, - post_process=False, - drop_path_rate: float = 0.0, - model_comm_pgs=default_model_comm_pgs, - ): - super().__init__(config) - if model_type: - raise NotImplementedError("For ParallelTransformer, 'model_type' is not support for now.") - if layer_type: - raise NotImplementedError("For ParallelTransformer, 'layer_type' is not support for now.") - if self_attn_mask_type: - raise NotImplementedError("For ParallelTransformer, 'self_attn_mask_type' is not support for now.") - if pre_process: - raise NotImplementedError("For ParallelTransformer, 'pre_process' is not support for now.") - if post_process: - raise NotImplementedError("For ParallelTransformer, 'post_process' is not support for now.") - if drop_path_rate: - raise NotImplementedError("For ParallelTransformer, 'drop_path_rate' is not support for now.") - self.config = config - self.post_norm = post_norm - self.head_dim = config.hidden_size // config.num_heads - self.num_layers = config.num_layers - self.use_past = config.use_past - self.is_first_iteration = True - self.use_flash_attention = config.use_flash_attention - self.compute_dtype = config.compute_dtype - - self.cast = ops.Cast() - self.shape = ops.Shape() - - self.freqs_mgr = FreqsMgr(head_dim=self.head_dim, - seq_length=config.seq_length, - max_position_embedding=config.max_position_embedding, - rotary_dtype=config.rotary_dtype, - theta=config.theta, - scaling_factor=config.scaling_factor, - extend_method=config.extend_method, - parallel_config=config.parallel_config, - is_dynamic=config.is_dynamic) - self.casual_mask = LowerTriangularMaskWithDynamic(seq_length=config.seq_length, - compute_type=config.compute_dtype, - is_dynamic=config.is_dynamic, - pad_token_id=config.pad_token_id, - use_flash_attention=config.use_flash_attention, - use_attn_mask_compression=config.use_attn_mask_compression, - use_past=config.use_past) - - self.tp = model_comm_pgs.tp - self.tp_group_size = self.tp.size - if config.parallel_config.vocab_emb_dp or self.tp_group_size == 1: - self.tok_embeddings = VocabEmbedding( - num_embeddings=config.vocab_size, - embedding_dim=config.hidden_size, - param_init_type=config.param_init_dtype, - param_init="normal", - ) - else: - self.tok_embeddings = VocabParallelEmbedding(num_embeddings=config.vocab_size, - embedding_dim=config.hidden_size, - parallel_config=config.parallel_config, - init_method="normal", - init_type=config.param_init_dtype, - tp_group=self.tp) - - self.layers = nn.CellList() - for layer_id in range(config.num_layers): - layer = ParallelTransformerLayer( - config=self.config, - layer_number=layer_id + 1, - model_comm_pgs=model_comm_pgs - ) - self.layers.append(layer) - - if self.post_norm: - # final layernorm before output. - self.norm_out = RMSNorm(dim=config.hidden_size, - eps=config.layernorm_epsilon, - compute_type=config.layernorm_compute_dtype) - - def construct(self, tokens: Tensor, batch_valid_length=None, batch_index=None, zactivate_len=None, - block_tables=None, slot_mapping=None, prefix_keys_values=None, position_ids=None, attention_mask=None, - q_seq_lens=None, key_cache=None, value_cache=None): - """ - Forward of ParallelTransformer. - - Args: - tokens: the tokenized inputs with datatype int32 - batch_valid_length(Tensor): the past calculated the index with datatype int32, used for incremental - prediction. Tensor of shape :math:`(batch_size,)`. Default None. - block_tables (Tensor[int64]): Store mapping tables for each sequence. - slot_mapping (Tensor[int32]): Store token cache physical slot index. - Returns: - output: Tensor, the output of ParallelTransformer - """ - # preprocess - mask = attention_mask - if self.use_past: - if self.is_first_iteration: - freqs_cis = self.freqs_mgr.prefill() - - if prefix_keys_values is not None: - bs, seq_len = self.shape(tokens) - if mask is None: - mask = self.casual_mask(tokens) - prefix_length = prefix_keys_values[0].shape[2] - prefix_mask = Tensor(np.zeros((bs, 1, seq_len, prefix_length)), dtype=mask.dtype) - mask = self.concat((prefix_mask, mask)) - else: - freqs_cis = self.freqs_mgr.chunk_with_decode(position_ids) - else: - bs, seq_len = self.shape(tokens) - mask = self.casual_mask(tokens) - freqs_cis = self.freqs_mgr(seq_len) - if prefix_keys_values is not None: - prefix_length = prefix_keys_values[0].shape[2] - prefix_mask = Tensor(np.zeros((bs, 1, seq_len, prefix_length)), dtype=mask.dtype) - mask = self.concat((prefix_mask, mask)) - - # tokens shape: [bs, seq / 1] - hidden_states = self.cast(self.tok_embeddings(tokens), self.compute_dtype) - # hidden states shape: [bs, seq / 1, hidden_dim] - for i in range(self.num_layers): - prefix_kv = prefix_keys_values[i] if prefix_keys_values is not None else None - key_cache_i = key_cache[i] if key_cache is not None else None - value_cache_i = value_cache[i] if value_cache is not None else None - hidden_states = self.layers[i](hidden_states, freqs_cis, mask, batch_valid_length=batch_valid_length, - block_tables=block_tables, slot_mapping=slot_mapping, - prefix_keys_values=prefix_kv, q_seq_lens=q_seq_lens, - key_cache=key_cache_i, value_cache=value_cache_i) - - if self.post_norm: - hidden_states = self.norm_out(hidden_states) - return hidden_states diff --git a/research/llama3_1/llama.py b/research/llama3_1/llama.py deleted file mode 100644 index 42dee2d2bf210dd09073a04f6e9fc0843a3cb9ea..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama.py +++ /dev/null @@ -1,434 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -LLaMA models' APIs. -""" -from multiprocessing.managers import DictProxy -from multiprocessing.synchronize import Condition - -from safetensors import safe_open -import numpy as np - -import mindspore as ms -import mindspore.common.dtype as mstype -from mindspore import Tensor, ops, mint, mutable -from mindspore.communication._comm_helper import _is_initialized as mindspore_comm_has_init - -from mindspore.communication import get_group_size -from mindformers.parallel_core.process_group_config import ModelCommProcessGroups -from mindformers.parallel_core.inference.parallel_state import initialize_model_parallel, is_initialized -from mindformers.models.llama.llama import LlamaPreTrainedModel -from mindformers.modules import Linear -from mindformers.utils import deprecated -from mindformers.tools.register.register import MindFormerModuleType, MindFormerRegister -from mindformers.tools.utils import get_predict_run_mode -from mindformers.tools.logger import logger -from mindformers.models.utils import jit -from mindformers.generation.utils import convert_pin -from research.llama3_1.infer.layers import ColumnParallelLinear -from research.llama3_1.infer.transformer import ParallelTransformer -from research.llama3_1.utils import convert_model_config - - -@deprecated(reason="This method is rotten.", version="1.6.0") -@MindFormerRegister.register(MindFormerModuleType.MODELS) -class ParallelLlamaForCausalLM(LlamaPreTrainedModel): - r""" - Provide llama training loss or logits through network. - - Args: - config (LlamaConfig): The config of llama model. - - Returns: - output: Tensor, the output of llama decoderlayer - - """ - - def __init__(self, config): - super().__init__(config, auto_prefix=True) - self.config = convert_model_config(config) - if not is_initialized() and mindspore_comm_has_init(): - initialize_model_parallel(get_group_size(), order='tp') - if is_initialized(): - model_comm_pgs = ModelCommProcessGroups.use_parallel_state_groups(required_groups=['tp']) - else: - model_comm_pgs = ModelCommProcessGroups.get_default_model_comm_pgs() - self.ignore_token_id = config.ignore_token_id - self.pad_token_id = config.pad_token_id - self.use_past = config.use_past - self.vocab_size = config.vocab_size - self.is_first_iteration = True - - self.shape = ops.Shape() - self.reshape = ops.Reshape() - self.cast = ops.Cast() - self.slice = ops.StridedSlice() - self.not_equal = ops.NotEqual() - self.mul = ops.Mul() - self.add = ops.Add() - self.ones = ops.Ones() - self.gather = ops.Gather() - self.sub_batch_valid_len = ops.Sub() - self.model = ParallelTransformer(config=config, model_comm_pgs=model_comm_pgs) - if config.parallel_config.vocab_emb_dp: - self.lm_head = Linear( - in_channels=config.hidden_size, - out_channels=config.vocab_size, - weight_init="normal", - has_bias=False, - param_init_type=config.param_init_type, - compute_dtype=config.compute_dtype - ) - else: - self.lm_head = ColumnParallelLinear( - input_size=config.hidden_size, - output_size=config.vocab_size, - config=config.parallel_config, - bias=False, - gather_output=True, - param_init_type=config.param_init_dtype, - compute_dtype=config.compute_dtype, - tp_group=model_comm_pgs.tp, - ) - - self.load_checkpoint(config) - self.predict_run_mode = get_predict_run_mode() - - self.use_past = config.use_past - self.npu_mem_size = config.npu_mem_size if hasattr(config, "npu_mem_size") else 2 - - def prepare_inputs_for_predict_layout(self, input_ids, **kwargs): - """Get Llama model input tuple for transform ckpt.""" - input_ids = Tensor(input_ids, mstype.int32) - labels = Tensor(kwargs["labels"]) if "labels" in kwargs else None - bs, seq = input_ids.shape[0], input_ids.shape[1] - slot_mapping = Tensor(np.ones(shape=tuple([bs * seq])), mstype.int32) - prefix_keys_values = Tensor(kwargs["prefix_keys_values"]) if "prefix_keys_values" in kwargs else None - return input_ids, labels, None, None, None, None, None, None, None, None, None, slot_mapping, prefix_keys_values - - def prepare_inputs_for_generation(self, input_ids, **kwargs): - """ - prepare inputs for generation. - A model class needs to define a `prepare_inputs_for_generation` method - in order to use `.generate()` - - """ - model_inputs = {"input_ids": Tensor.from_numpy(input_ids.astype(np.int32))} - batch_valid_length = kwargs.get("valid_length_each_example") - prefill = kwargs.get("prefill") - - if self.config.is_dynamic and prefill and "origin_inputs" in kwargs: - origin_inputs = kwargs["origin_inputs"] - slot_mapping = kwargs.get("slot_mapping") - model_inputs = self._prepare_inputs_for_prefill_flatten(origin_inputs, - batch_valid_length, - slot_mapping, - model_inputs,) - position_ids = batch_valid_length.astype(np.int32) - 1 - model_inputs["position_ids"] = ms.Tensor(position_ids, dtype=ms.int32).reshape(-1) - - if not prefill: - q_seq_lens = np.ones(batch_valid_length.shape, dtype=np.int32).reshape(-1) - else: - q_seq_lens = batch_valid_length.astype(np.int32).reshape(-1) - model_inputs["q_seq_lens"] = convert_pin(Tensor.from_numpy(q_seq_lens)) - - model_inputs["attention_mask"] = self.model.casual_mask.gen_attention_mask(prefill) - model_inputs["need_flatten"] = True - return model_inputs - - def set_dynamic_inputs(self, **kwargs): - """Prepare inputs for dynamic shape.""" - dynamic_input_ids = Tensor(shape=[None], dtype=mstype.int32) - dynamic_batch_valid_length = Tensor(shape=[None], dtype=mstype.int32) - dynamic_block_tables = Tensor(shape=[None, None], dtype=mstype.int32) - dynamic_slot_mapping = Tensor(shape=[None], dtype=mstype.int32) - dynamic_position_ids = Tensor(shape=[None], dtype=mstype.int32) - dynamic_q_seq_lens = Tensor(shape=[None], dtype=mstype.int32) - dynamic_attention_mask = Tensor(shape=[None, None], dtype=self.config.compute_dtype) - have_prefix_keys_values = getattr(kwargs, "have_prefix_keys_values", False) - - def get_input(): - if self.npu_mem_size > 0: - return None - cache_list = [] - for _ in self.model.layers: - cache_list.append(Tensor(shape=[None, None, None, None], dtype=self.config.compute_dtype)) - return mutable(cache_list) - key_cache = get_input() - value_cache = get_input() - if have_prefix_keys_values: - dynamic_prefix_keys_values = Tensor(shape=[2, None, None, None, None], dtype=mstype.float16) - self.set_inputs(dynamic_input_ids, None, None, None, None, None, None, - dynamic_batch_valid_length, None, None, dynamic_block_tables, - dynamic_slot_mapping, dynamic_prefix_keys_values, None, key_cache, value_cache) - else: - self.set_inputs(dynamic_input_ids, None, None, dynamic_position_ids, dynamic_attention_mask, None, None, - dynamic_batch_valid_length, None, None, dynamic_block_tables, - dynamic_slot_mapping, None, None, key_cache, value_cache, dynamic_q_seq_lens) - logger.info("Set dynamic input for llama.") - - def add_flags_custom(self, is_first_iteration): - """Add customized attributes for specific cells in the model.""" - self.add_flags(is_first_iteration=is_first_iteration) - self.model.add_flags(is_first_iteration=is_first_iteration) - for layer in self.model.layers: - layer.add_flags(is_first_iteration=is_first_iteration) - layer.attention.add_flags(is_first_iteration=is_first_iteration) - layer.attention.paged_attention_mgr.add_flags(is_first_iteration=is_first_iteration) - - @jit - def construct(self, input_ids, labels=None, input_position=None, position_ids=None, attention_mask=None, - input_embeds=None, init_reset=None, batch_valid_length=None, batch_index=None, zactivate_len=None, - block_tables=None, slot_mapping=None, prefix_keys_values=None, llm_boost_inputs=None, - key_cache=None, value_cache=None, q_seq_lens=None): - """ - Forward of llama model. - """ - output = self.model(input_ids, batch_valid_length, batch_index, zactivate_len, block_tables, - slot_mapping, prefix_keys_values, key_cache=key_cache, value_cache=value_cache, - position_ids=position_ids, attention_mask=attention_mask, q_seq_lens=q_seq_lens) - pre_gather = (not self.use_past or self.is_first_iteration) and batch_valid_length is not None - if pre_gather: - batch_valid_length = mint.cumsum(batch_valid_length, 0) - output = self.gather(output, self.sub_batch_valid_len(batch_valid_length, 1), 0) - logits = self.lm_head(output) - - logits = self.cast(logits, mstype.float32) - if self.predict_run_mode: - return self.reshape(logits, (-1, logits.shape[-1])) - input_mask = self.cast(self.not_equal(input_ids, self.pad_token_id), mstype.float32) - return logits, input_ids, input_mask - - def kvcache(self, layer_idx): - key_cache = self.model.layers[layer_idx].attention.paged_attention_mgr.key_cache - value_cache = self.model.layers[layer_idx].attention.paged_attention_mgr.value_cache - return key_cache, value_cache - - @classmethod - def convert_name(cls, weight_name): - """convert HuggingFace weight name to MindFormers weight name""" - origin_name = weight_name - weight_name = weight_name.replace('embed_tokens.', 'tok_embeddings.') - weight_name = weight_name.replace('.self_attn.q_proj.', '.attention.wq.') - weight_name = weight_name.replace('.self_attn.k_proj.', '.attention.wk.') - weight_name = weight_name.replace('.self_attn.v_proj.', '.attention.wv.') - weight_name = weight_name.replace('.self_attn.o_proj.', '.attention.wo.') - weight_name = weight_name.replace('.mlp.gate_proj.', '.feed_forward.w1.') - weight_name = weight_name.replace('.mlp.down_proj.', '.feed_forward.w2.') - weight_name = weight_name.replace('.mlp.up_proj.', '.feed_forward.w3.') - weight_name = weight_name.replace('.input_layernorm.', '.attention_norm.') - weight_name = weight_name.replace('.post_attention_layernorm.', '.ffn_norm.') - weight_name = weight_name.replace('.norm.', '.norm_out.') - weight_name = weight_name.replace('output.', 'lm_head.') - weight_name = weight_name.replace('.tok_embeddings.weight', '.tok_embeddings.embedding_weight') - if weight_name == origin_name: - logger.warning(f"weight name '{weight_name}' does not change after conversion. " - f"Please check if it is as expected.") - return weight_name - - @classmethod - def convert_weight_dict(cls, source_dict, **kwargs): - """convert HuggingFace weight dict to MindFormers weight dict""" - model_config = kwargs.get("model_config") - qkv_concat = model_config.qkv_concat - target_dict = {} - wq_keys = [] - wk_keys = [] - wv_keys = [] - w1_keys = [] - w3_keys = [] - - for k, v in source_dict.items(): - k = cls.convert_name(k) - target_dict.update({k: v}) - if qkv_concat: - part = k.split('.') - if part[-2] == 'wq': - wq_keys.append(k) - if part[-2] == 'wk': - wk_keys.append(k) - if part[-2] == 'wv': - wv_keys.append(k) - if part[-2] == 'w1': - w1_keys.append(k) - if part[-2] == 'w3': - w3_keys.append(k) - - if qkv_concat: - qkv_dict = kwargs.get('qkv_dict', None) - if not isinstance(qkv_dict, DictProxy): - raise ValueError(f'qkv_queue must be a queue, when qkv_concat is True, but got {qkv_dict}.') - condition = kwargs.get('condition', None) - if not isinstance(condition, Condition): - raise ValueError(f'condition must be a Condition, when qkv_concat is True, but got {condition}.') - _concat_qkv_weight(wq_keys, wk_keys, wv_keys, model_config, qkv_dict, condition, target_dict) - _concat_ffn_weight(w1_keys, w3_keys, model_config, qkv_dict, condition, target_dict) - - return target_dict - - @classmethod - def convert_map_dict(cls, source_dict, **kwargs): - """convert HuggingFace map dict to MindFormers map dict""" - qkv_concat = kwargs.pop("qkv_concat", False) - target_dict = {} - wq_keys = [] - w1_keys = [] - - for k, v in source_dict.items(): - k = cls.convert_name(k) - target_dict.update({k: v}) - if qkv_concat: - part = k.split('.') - if part[-2] == 'wq': - wq_keys.append(k) - if part[-2] == 'w1': - w1_keys.append(k) - - if qkv_concat: - for wq_key in wq_keys: - wk_key = wq_key.replace('wq', 'wk') - wv_key = wq_key.replace('wq', 'wv') - wq_value = target_dict.pop(wq_key) - target_dict.pop(wk_key) - target_dict.pop(wv_key) - - w_qkv_key = wq_key.replace('wq', 'w_qkv') - w_qkv_value = wq_value - target_dict.update({w_qkv_key: w_qkv_value}) - for w1_key in w1_keys: - w3_key = w1_key.replace('w1', 'w3') - w1_value = target_dict.pop(w1_key) - target_dict.pop(w3_key) - - w_gate_hidden_key = w1_key.replace('w1', 'w_gate_hidden') - w_gate_hidden_value = w1_value - target_dict.update({w_gate_hidden_key: w_gate_hidden_value}) - - return target_dict - - @classmethod - def obtain_qkv_ffn_concat_keys(cls): - qkv_key = "w_qkv" - ffn_key = "w_gate_hidden" - concat_keys = [qkv_key, ffn_key] - logger.info(f"{cls.__name__} qkv/ffn concat keys are {concat_keys}") - return concat_keys - - @classmethod - def obtain_name_map(cls, load_checkpoint_files): - name_map = dict() - for checkpoint_file in load_checkpoint_files: - with safe_open(checkpoint_file, framework="np") as f: - for k in f.keys(): - name_map.update({cls.convert_name(k): k}) - return name_map - - def clear_kv_cache(self): - return self.model.clear_kv_cache() - - -def _concat_qkv_weight(wq_keys, wk_keys, wv_keys, model_config, qkv_dict, condition, target_dict): - """concat qkv weight from dicts""" - from mindformers.utils.convert_utils import qkv_concat_hf2mg - - num_heads = model_config.num_heads - n_kv_heads = model_config.n_kv_heads or num_heads - hidden_size = model_config.hidden_size - - # pop extra weight to shared dict if there is no corresponding weight for concat in the target dict - for wk_key in wk_keys: - wq_key = wk_key.replace('wk', 'wq') - if wq_key not in wq_keys: - with condition: - qkv_dict[wk_key] = target_dict.pop(wk_key) # add extra weight to shared dict - condition.notify_all() - for wv_key in wv_keys: - wq_key = wv_key.replace('wv', 'wq') - if wq_key not in wq_keys: - with condition: - qkv_dict[wv_key] = target_dict.pop(wv_key) # add extra weight to shared dict - condition.notify_all() - - # concat qkv - for wq_key in wq_keys: - wk_key = wq_key.replace('wq', 'wk') - wv_key = wq_key.replace('wq', 'wv') - wq_value = target_dict.pop(wq_key) - wk_value = target_dict.pop(wk_key, None) - wv_value = target_dict.pop(wv_key, None) - - # get missing weight from shared dict - if wk_value is None: - with condition: - condition.wait_for(lambda: wk_key in qkv_dict.keys()) - wk_value = qkv_dict.pop(wk_key) - if wv_value is None: - with condition: - condition.wait_for(lambda: wv_key in qkv_dict.keys()) - wv_value = qkv_dict.pop(wv_key) - - w_qkv_key = wq_key.replace('wq', 'w_qkv') - w_qkv_value = np.concatenate((wq_value, wk_value, wv_value), 0) - # qkv weight format: hf -> mg - w_qkv_value_mg = qkv_concat_hf2mg(w_qkv_value, num_heads, n_kv_heads, hidden_size) - target_dict.update({w_qkv_key: w_qkv_value_mg}) - - -def _concat_ffn_weight(w1_keys, w3_keys, model_config, qkv_dict, condition, target_dict): - """concat ffn weight from dicts""" - from mindformers.utils.convert_utils import ffn_concat_hf2mg - - intermediate_size = model_config.intermediate_size - ffn_dim_multiplier = model_config.ffn_dim_multiplier - multiple_of = model_config.multiple_of or 256 - ffn_hidden_size = model_config.hidden_size * 4 - if intermediate_size is not None: - ffn_hidden_size = intermediate_size - else: - if ffn_dim_multiplier is not None: - ffn_hidden_size = int((ffn_dim_multiplier + 0.01) * ffn_hidden_size) - ffn_hidden_size = int(2 * ffn_hidden_size / 3) - ffn_hidden_size = multiple_of * \ - ((ffn_hidden_size + multiple_of - 1) // multiple_of) - - # pop extra weight to shared dict if there is no corresponding weight for concat in the target dict - for w3_key in w3_keys: - w1_key = w3_key.replace('w3', 'w1') - if w1_key not in w1_keys: - with condition: - qkv_dict[w3_key] = target_dict.pop(w3_key) # add extra weight to shared dict - condition.notify_all() - - # concat ffn - for w1_key in w1_keys: - w3_key = w1_key.replace('w1', 'w3') - w1_value = target_dict.pop(w1_key) - w3_value = target_dict.pop(w3_key, None) - - # get missing weight from shared dict - if w3_value is None: - with condition: - condition.wait_for(lambda: w3_key in qkv_dict.keys()) - w3_value = qkv_dict.pop(w3_key) - - w_gate_hidden_key = w1_key.replace('w1', 'w_gate_hidden') - w_gate_hidden_value = np.concatenate((w1_value, w3_value), 0) - # ffn weight format: hf -> mg - w_gate_hidden_value_mg = ffn_concat_hf2mg(w_gate_hidden_value, ffn_hidden_size) - target_dict.update({w_gate_hidden_key: w_gate_hidden_value_mg}) diff --git a/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml b/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml deleted file mode 100644 index 71fba283777a8f4c6f17e431809c40409ae10885..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml +++ /dev/null @@ -1,163 +0,0 @@ -seed: 0 -output_dir: './output' # path to save checkpoint/strategy -load_checkpoint: '' -src_strategy_path_or_dir: '' -auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model -only_save_strategy: False -resume_training: False -run_mode: 'train' -load_ckpt_format: 'ckpt' # recommend use 'safetensors' - -# trainer config -trainer: - type: CausalLanguageModelingTrainer - model_name: 'llama3_1_70b' - -# runner config -runner_config: - epochs: 2 - batch_size: 1 - sink_mode: True - sink_size: 1 - -# optimizer -optimizer: - type: AdamW - betas: [0.9, 0.999] - eps: 1.e-8 - -# lr schedule -lr_schedule: - type: CosineWithWarmUpLR - learning_rate: 1.e-5 - lr_end: 0.0 - warmup_ratio: 0.03 - total_steps: -1 # -1 means it will load the total steps of the dataset - -# dataset -train_dataset: &train_dataset - data_loader: - type: MindDataset - dataset_dir: "" - shuffle: True - input_columns: ["input_ids", "labels"] # "input_ids", "labels" , labels are used in instruction finetune. - num_parallel_workers: 8 - python_multiprocessing: False - drop_remainder: True - numa_enable: False - prefetch_size: 1 -train_dataset_task: - type: CausalLanguageModelDataset - dataset_config: *train_dataset - -use_parallel: True -# parallel context config -parallel: - parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel - gradients_mean: False - enable_alltoall: False - full_batch: True - search_mode: "sharding_propagation" - enable_parallel_optimizer: True - strategy_ckpt_save_file: "./ckpt_strategy.ckpt" - parallel_optimizer_config: - gradient_accumulation_shard: False - parallel_optimizer_threshold: 64 -# default parallel of device num = 8 for Atlas 800T A2 -parallel_config: - data_parallel: 1 - model_parallel: 8 - pipeline_stage: 8 - use_seq_parallel: True - micro_batch_num: 256 - vocab_emb_dp: False - gradient_aggregation_group: 4 -# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. -micro_batch_interleave_num: 1 - -# recompute config -recompute_config: - recompute: False - select_recompute: [10,8,6,4,2,0,0,0] - select_comm_recompute: [10,8,6,4,2,0,0,0] - parallel_optimizer_comm_recompute: False - mp_comm_recompute: True - recompute_slice_activation: True - -# callbacks -callbacks: - - type: MFLossMonitor - - type: CheckpointMonitor - prefix: "llama3_1_70b" - checkpoint_format: "ckpt" # recommend use 'safetensors' - save_checkpoint_steps: 10000 - integrated_save: False - -# mindspore context init config -context: - mode: 0 #0--Graph Mode; 1--Pynative Mode - device_target: "Ascend" - max_call_depth: 10000 - max_device_memory: "52.5GB" - mempool_block_size: "52.5GB" - save_graphs: False - save_graphs_path: "./graph" - device_id: 0 - jit_config: - jit_level: "O1" - memory_optimize_level: "O0" - -# model config -model: - model_config: - type: LlamaConfig - batch_size: 1 # add for increase predict - seq_length: 8192 - hidden_size: 8192 - num_layers: 80 - num_heads: 64 - n_kv_heads: 8 - ffn_dim_multiplier: 1.3 - multiple_of: 256 - vocab_size: 128256 - rms_norm_eps: 1.0e-5 - bos_token_id: 128000 - eos_token_id: 128001 - pad_token_id: 128002 - ignore_token_id: -100 - compute_dtype: "bfloat16" - layernorm_compute_type: "float32" - softmax_compute_type: "float32" - rotary_dtype: "float32" - param_init_type: "float32" - use_past: False - scaling_factor: 1.0 - theta: 500000 - extend_method: "None" # support "None", "PI", "NTK" - use_flash_attention: True # FA can accelerate training or finetune - offset: 0 - fine_grain_interleave: 2 - checkpoint_name_or_path: "" - repetition_penalty: 1 - max_decode_length: 512 - top_k: 3 - top_p: 1 - do_sample: False - arch: - type: LlamaForCausalLM - -# wrapper cell config -runner_wrapper: - type: MFTrainOneStepCell - scale_sense: 1.0 - use_clip_grad: True - -profile: False -profile_start_step: 4 -profile_stop_step: 8 -init_start_profile: False -profile_communication: False -profile_memory: True -layer_scale: False -layer_decay: 0.65 -lr_scale_factor: 256 diff --git a/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml b/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml deleted file mode 100644 index 58cffa961446efbee9fd24e12b16dc288d67e8d4..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml +++ /dev/null @@ -1,135 +0,0 @@ -seed: 0 -output_dir: './output' # path to save checkpoint/strategy -load_checkpoint: '' -src_strategy_path_or_dir: '' -auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model -only_save_strategy: False -resume_training: False -run_mode: 'predict' - -# trainer config -trainer: - type: CausalLanguageModelingTrainer - model_name: 'llama3_1_70b' - -# runner config -runner_config: - epochs: 2 - batch_size: 1 - sink_mode: True - sink_size: 1 - -use_parallel: True -# parallel context config -parallel: - parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel - gradients_mean: False - enable_alltoall: False - full_batch: True - search_mode: "sharding_propagation" - enable_parallel_optimizer: False - strategy_ckpt_save_file: "./ckpt_strategy.ckpt" - parallel_optimizer_config: - gradient_accumulation_shard: False - parallel_optimizer_threshold: 64 -# default parallel of device num = 8 for Atlas 800T A2 -parallel_config: - data_parallel: 1 - model_parallel: 4 - pipeline_stage: 1 - use_seq_parallel: False - micro_batch_num: 1 - vocab_emb_dp: True - gradient_aggregation_group: 4 -# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. -micro_batch_interleave_num: 1 - -# mindspore context init config -context: - mode: 0 #0--Graph Mode; 1--Pynative Mode - device_target: "Ascend" - max_call_depth: 10000 - max_device_memory: "58GB" - save_graphs: False - save_graphs_path: "./graph" - device_id: 0 - -# model config -model: - model_config: - type: LlamaConfig - batch_size: 1 # add for increase predict - seq_length: 8192 - hidden_size: 8192 - num_layers: 80 - num_heads: 64 - n_kv_heads: 8 - ffn_dim_multiplier: 1.3 - multiple_of: 256 - vocab_size: 128256 - rms_norm_eps: 1.0e-5 - bos_token_id: 128000 - eos_token_id: 128001 - pad_token_id: 128002 - ignore_token_id: -100 - compute_dtype: "float16" - layernorm_compute_type: "float32" - softmax_compute_type: "float32" - rotary_dtype: "float32" - param_init_type: "bfloat16" - is_dynamic: True - theta: 500000 - max_position_embedding: 131072 - extend_method: "LLAMA3" # support "None", "PI", "NTK", "LLAMA3" - scaling_factor: - factor: 8.0 - low_freq_factor: 1.0 - high_freq_factor: 4.0 - original_max_position_embeddings: 8192 - use_past: True - use_flash_attention: True # FA can accelerate training or finetune - offset: 0 - checkpoint_name_or_path: "" - repetition_penalty: 1 - max_decode_length: 512 - block_size: 16 - num_blocks: 512 - top_k: 3 - top_p: 1 - do_sample: False - auto_map: - AutoTokenizer: [ llama3_1_tokenizer.Llama3Tokenizer, null ] - arch: - type: LlamaForCausalLM - -processor: - return_tensors: ms - tokenizer: - model_max_length: 8192 - vocab_file: "/path/tokenizer.model" - pad_token: "<|reserved_special_token_0|>" - type: Llama3Tokenizer - auto_register: llama3_1_tokenizer.Llama3Tokenizer - type: LlamaProcessor - -# metric -metric: - type: PerplexityMetric - - -auto_tune: False -filepath_prefix: './autotune' -autotune_per_step: 10 - -profile: False -profile_start_step: 4 -profile_stop_step: 8 -init_start_profile: False -profile_communication: False -profile_memory: True -layer_scale: False -layer_decay: 0.65 -lr_scale_factor: 256 - -# aicc -remote_save_url: "Please input obs url on AICC platform." diff --git a/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml b/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml deleted file mode 100644 index d3ee286e8a9a8853f2ffeac34c694db2b5902eca..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml +++ /dev/null @@ -1,163 +0,0 @@ -seed: 0 -output_dir: './output' # path to save checkpoint/strategy -load_checkpoint: '' -src_strategy_path_or_dir: '' -auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model -only_save_strategy: False -resume_training: False -run_mode: 'train' -load_ckpt_format: 'ckpt' # recommend use 'safetensors' - -# trainer config -trainer: - type: CausalLanguageModelingTrainer - model_name: 'llama3_1_8b' - -# runner config -runner_config: - epochs: 2 - batch_size: 1 - sink_mode: True - sink_size: 1 - -# optimizer -optimizer: - type: AdamW - betas: [0.9, 0.95] - eps: 1.e-8 - -# lr schedule -lr_schedule: - type: CosineWithWarmUpLR - learning_rate: 1.e-5 - lr_end: 0.0 - warmup_ratio: 0.03 - total_steps: -1 # -1 means it will load the total steps of the dataset - -# dataset -train_dataset: &train_dataset - data_loader: - type: MindDataset - dataset_dir: "" - shuffle: True - input_columns: ["input_ids", "labels"] # "input_ids", "labels" , labels are used in instruction finetune. - num_parallel_workers: 8 - python_multiprocessing: False - drop_remainder: True - numa_enable: False - prefetch_size: 1 -train_dataset_task: - type: CausalLanguageModelDataset - dataset_config: *train_dataset -# if True, do evaluate during the training process. if false, do nothing. -# note that the task trainer should support _evaluate_in_training function. - -use_parallel: True -# parallel context config -parallel: - parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel - gradients_mean: False - enable_alltoall: False - full_batch: True - search_mode: "sharding_propagation" - enable_parallel_optimizer: True - strategy_ckpt_save_file: "./ckpt_strategy.ckpt" - parallel_optimizer_config: - gradient_accumulation_shard: False - parallel_optimizer_threshold: 64 -# default parallel of device num = 8 for Atlas 800T A2 -parallel_config: - data_parallel: 8 - model_parallel: 1 - pipeline_stage: 1 - use_seq_parallel: False - micro_batch_num: 1 - vocab_emb_dp: True - gradient_aggregation_group: 4 -# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. -micro_batch_interleave_num: 1 - -# recompute config -recompute_config: - recompute: True - select_recompute: False - parallel_optimizer_comm_recompute: False - mp_comm_recompute: True - recompute_slice_activation: True - -# callbacks -callbacks: - - type: MFLossMonitor - - type: CheckpointMonitor - checkpoint_format: "ckpt" # recommend use 'safetensors' - prefix: "llama3_1_8b" - save_checkpoint_steps: 10000 - integrated_save: False - -# mindspore context init config -context: - mode: 0 #0--Graph Mode; 1--Pynative Mode - device_target: "Ascend" - max_call_depth: 10000 - max_device_memory: "58GB" - save_graphs: False - save_graphs_path: "./graph" - device_id: 0 - jit_config: - jit_level: "O1" - memory_optimize_level: "O0" - -# model config -model: - model_config: - type: LlamaConfig - batch_size: 1 # add for increase predict - seq_length: 8192 - hidden_size: 4096 - num_layers: 32 - num_heads: 32 - n_kv_heads: 8 - vocab_size: 128256 - intermediate_size: 14336 - rms_norm_eps: 1.0e-5 - bos_token_id: 128000 - eos_token_id: 128001 - pad_token_id: 128002 - ignore_token_id: -100 - compute_dtype: "bfloat16" - layernorm_compute_type: "float32" - softmax_compute_type: "float32" - rotary_dtype: "float32" - param_init_type: "float16" # for stable training, suggest configuring float32 - embedding_init_type: "bfloat16" - use_past: False - scaling_factor: 1.0 - theta: 500000 - extend_method: "None" # support "None", "PI", "NTK" - use_flash_attention: True # FA can accelerate training or finetune - offset: 0 - fine_grain_interleave: 1 - checkpoint_name_or_path: "" - repetition_penalty: 1 - max_decode_length: 512 - top_k: 3 - top_p: 1 - do_sample: False - arch: - type: LlamaForCausalLM - -# wrapper cell config -runner_wrapper: - type: MFTrainOneStepCell - scale_sense: 1.0 - use_clip_grad: True - -profile: False -profile_start_step: 4 -profile_stop_step: 8 -init_start_profile: False -profile_communication: False -profile_memory: True -layer_scale: False -layer_decay: 0.65 -lr_scale_factor: 256 diff --git a/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml b/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml deleted file mode 100644 index 4b8d1c2b13949e6217a2def983bb8013ebed3dc0..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml +++ /dev/null @@ -1,135 +0,0 @@ -seed: 0 -output_dir: './output' # path to save checkpoint/strategy -load_checkpoint: '' -src_strategy_path_or_dir: '' -auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model -only_save_strategy: False -resume_training: False -run_mode: 'predict' - -# trainer config -trainer: - type: CausalLanguageModelingTrainer - model_name: 'llama3_1_8b' - -# runner config -runner_config: - epochs: 2 - batch_size: 1 - sink_mode: True - sink_size: 1 - -use_parallel: False -# parallel context config -parallel: - parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel - gradients_mean: False - enable_alltoall: False - full_batch: True - search_mode: "sharding_propagation" - enable_parallel_optimizer: False - strategy_ckpt_save_file: "./ckpt_strategy.ckpt" - parallel_optimizer_config: - gradient_accumulation_shard: False - parallel_optimizer_threshold: 64 -# default parallel of device num = 8 for Atlas 800T A2 -parallel_config: - data_parallel: 1 - model_parallel: 1 - pipeline_stage: 1 - use_seq_parallel: False - micro_batch_num: 1 - vocab_emb_dp: True - gradient_aggregation_group: 4 -# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. -micro_batch_interleave_num: 1 - -# mindspore context init config -context: - mode: 0 #0--Graph Mode; 1--Pynative Mode - device_target: "Ascend" - max_call_depth: 10000 - max_device_memory: "58GB" - save_graphs: False - save_graphs_path: "./graph" - device_id: 0 - -# model config -model: - model_config: - type: LlamaConfig - batch_size: 1 # add for increase predict - seq_length: 512 - hidden_size: 4096 - num_layers: 32 - num_heads: 32 - n_kv_heads: 8 - vocab_size: 128256 - intermediate_size: 14336 - rms_norm_eps: 1.0e-5 - bos_token_id: 128000 - eos_token_id: 128001 - pad_token_id: 128002 - ignore_token_id: -100 - max_position_embedding: 131072 - compute_dtype: "float16" - layernorm_compute_type: "float32" - softmax_compute_type: "float32" - rotary_dtype: "float32" - param_init_type: "bfloat16" - use_past: True - is_dynamic: True - theta: 500000 - extend_method: "LLAMA3" # support "None", "PI", "NTK", "LLAMA3" - scaling_factor: - factor: 8.0 - low_freq_factor: 1.0 - high_freq_factor: 4.0 - original_max_position_embeddings: 8192 - use_flash_attention: True # FA can accelerate training or finetune - offset: 0 - fine_grain_interleave: 1 - checkpoint_name_or_path: "" - repetition_penalty: 1 - max_decode_length: 512 - block_size: 16 - num_blocks: 512 - top_k: 3 - top_p: 1 - do_sample: False - auto_map: - AutoTokenizer: [ llama3_1_tokenizer.Llama3Tokenizer, null ] - arch: - type: LlamaForCausalLM - -processor: - return_tensors: ms - tokenizer: - model_max_length: 8192 - vocab_file: "/path/tokenizer.model" - pad_token: "<|reserved_special_token_0|>" - type: Llama3Tokenizer - auto_register: llama3_1_tokenizer.Llama3Tokenizer - type: LlamaProcessor - -# metric -metric: - type: PerplexityMetric - - -auto_tune: False -filepath_prefix: './autotune' -autotune_per_step: 10 - -profile: False -profile_start_step: 4 -profile_stop_step: 8 -init_start_profile: False -profile_communication: False -profile_memory: True -layer_scale: False -layer_decay: 0.65 -lr_scale_factor: 256 - -# aicc -remote_save_url: "Please input obs url on AICC platform." diff --git a/research/llama3_1/llama3_1_conversation.py b/research/llama3_1/llama3_1_conversation.py deleted file mode 100644 index c4e466a411a664c2eac4f47b411ea52de0dec96d..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_conversation.py +++ /dev/null @@ -1,184 +0,0 @@ -# Adapted from lm-sys@FastChat. Below is the original copyright: -# Copyright 2023 Wei-Lin Chiang, Lianmin Zheng, Ying Sheng -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -"""conversation prompt templates""" - -import dataclasses -from enum import auto, Enum -from typing import List, Any - - -class SeparatorStyle(Enum): - """Different separator style.""" - - ADD_COLON_SINGLE = auto() - ADD_COLON_TWO = auto() - NO_COLON_SINGLE = auto() - BAIZE = auto() - DOLLY = auto() - RWKV = auto() - - -@dataclasses.dataclass -class Conversation: - """A class that keeps all conversation history.""" - - # System prompts - system: str - # Two roles - roles: List[str] - # All messages - messages: List[List[str]] - # Offset of few shot examples - offset: int - # Separator - sep_style: SeparatorStyle - sep: str - sep2: str = None - # Stop criteria (the default one is EOS token) - stop_str: str = None - # Stops generation if meeting any token in this list - stop_token_ids: List[int] = None - - # Used for the state in the gradio servers. - conv_id: Any = None - skip_next: bool = False - model_name: str = None - - def get_prompt(self): - """Get the prompt for generation.""" - if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE: - ret = self.system + self.sep - for role, message in self.messages: - if message: - ret += role + ": " + message + self.sep - else: - ret += role + ":" - return ret - if self.sep_style == SeparatorStyle.ADD_COLON_TWO: - seps = [self.sep, self.sep2] - ret = self.system + seps[0] - for i, (role, message) in enumerate(self.messages): - if message: - ret += role + ": " + message + seps[i % 2] - else: - ret += role + ":" - return ret - raise ValueError(f"Invalid style: {self.sep_style}") - - def append_message(self, role, message): - """Append a new message.""" - self.messages.append([role, message]) - - def to_openai_api_messages(self): - """Convert the conversation to OpenAI chat completion format.""" - ret = [{"role": "system", "content": self.system}] - - for i, (_, msg) in enumerate(self.messages[self.offset:]): - if i % 2 == 0: - ret.append({"role": "user", "content": msg}) - else: - if msg is not None: - ret.append({"role": "assistant", "content": msg}) - return ret - - def copy(self): - return Conversation( - system=self.system, - roles=self.roles, - messages=[[x, y] for x, y in self.messages], - offset=self.offset, - sep_style=self.sep_style, - sep=self.sep, - sep2=self.sep2, - stop_str=self.stop_str, - stop_token_ids=self.stop_token_ids, - conv_id=self.conv_id, - model_name=self.model_name, - ) - - def dict(self): - return { - "system": self.system, - "roles": self.roles, - "messages": self.messages, - "offset": self.offset, - "conv_id": self.conv_id, - "model_name": self.model_name, - } - - -# A template with one conversation example -conv_one_shot = Conversation( - system="A chat between a curious human and an artificial intelligence assistant. " - "The assistant gives helpful, detailed, and polite answers to the human's questions.", - roles=("Human", "Assistant"), - messages=( - ( - "Human", - "What are the key differences between renewable and non-renewable energy sources?", - ), - ( - "Assistant", - "Renewable energy sources are those that can be replenished naturally in a relatively " - "short amount of time, such as solar, wind, hydro, geothermal, and biomass. " - "Non-renewable energy sources, on the other hand, are finite and will eventually be " - "depleted, such as coal, oil, and natural gas. Here are some key differences between " - "renewable and non-renewable energy sources:\n" - "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable " - "energy sources are finite and will eventually run out.\n" - "2. Environmental impact: Renewable energy sources have a much lower environmental impact " - "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, " - "and other negative effects.\n" - "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically " - "have lower operational costs than non-renewable sources.\n" - "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote " - "locations than non-renewable sources.\n" - "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different " - "situations and needs, while non-renewable sources are more rigid and inflexible.\n" - "6. Sustainability: Renewable energy sources are more sustainable over the long term, while " - "non-renewable sources are not, and their depletion can lead to economic and social instability.", - ), - ), - offset=2, - sep_style=SeparatorStyle.ADD_COLON_SINGLE, - sep="\n### ", - stop_str="###", -) - - -# Vicuna v1.1 template -conv_vicuna_v1_1 = Conversation( - system="A chat between a curious user and an artificial intelligence assistant. " - "The assistant gives helpful, detailed, and polite answers to the user's questions.", - roles=("USER", "ASSISTANT"), - messages=(), - offset=0, - sep_style=SeparatorStyle.ADD_COLON_TWO, - sep=" ", - sep2="", -) - -conv_templates = { - "conv_one_shot": conv_one_shot, - "vicuna_v1.1": conv_vicuna_v1_1, -} - - -def get_default_conv_template(model_name): - model_name = model_name.lower() - if "vicuna" in model_name or "output" in model_name: - return conv_vicuna_v1_1 - return conv_one_shot diff --git a/research/llama3_1/llama3_1_preprocess.py b/research/llama3_1/llama3_1_preprocess.py deleted file mode 100644 index d201f46ed8b7b9055954dad9bb11537127ed151e..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_preprocess.py +++ /dev/null @@ -1,228 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ - -""" -transform wikitext-2, wikitext-103, lambada, openwebtext dataset to mindrecord. -""" -import argparse -import json -import os -import numpy as np -from mindspore.mindrecord import FileWriter -from mindformers.tools import logger -from mindformers.dataset.dataloader.datareaders import wikitext_clean -from llama3_1_tokenizer import Llama3Tokenizer -from llama3_1_conversation import get_default_conv_template - - -IGNORE_TOKEN_ID = -100 - - -def chunks(lst, n): - """ yield n sized chunks from list""" - for i in range(0, len(lst), n): - yield lst[i:i + n] - - -def preprocess(sources, tokenizer, seq_length): - """conversation preprocess.""" - conv = get_default_conv_template("vicuna").copy() - roles = {"human": conv.roles[0], "gpt": conv.roles[1]} - - # Apply prompt templates - conversations = [] - for i, source in enumerate(sources): - if roles.get(source[0].get("from")) != conv.roles[0]: - # Skip the first one if it is not from human - source = source[1:] - - conv.messages = [] - for j, sentence in enumerate(source): - role = roles.get(sentence.get("from")) - if role != conv.roles[j % 2]: - raise ValueError(f"sources[{i}] is wrong.") - conv.append_message(role, sentence["value"]) - conversations.append(conv.get_prompt()) - - sep = conv.sep + conv.roles[1] + ": " - # Tokenize conversations - input_ids = [] - targets = [] - for conversation in conversations: - rounds = conversation.split(conv.sep2) - ids = [tokenizer.bos_token_id] - mask = [1] - for _, rou in enumerate(rounds): - if rou == "": - break - conv_out = tokenizer(rou) - ids.extend(conv_out['input_ids'][1:]) - mask.extend(conv_out['attention_mask'][1:]) - d = {'input_ids': ids, 'attention_mask': mask} - # pylint: disable=W0212 - d = tokenizer._pad(d, max_length=seq_length, padding_strategy='max_length') - input_ids.append(d['input_ids'][:seq_length]) - - target = np.array(d['input_ids']) - total_len = int(np.not_equal(target, tokenizer.pad_token_id).sum()) - cur_len = 1 - target[:cur_len] = IGNORE_TOKEN_ID - for _, rou in enumerate(rounds): - if rou == "": - break - parts = rou.split(sep) - if len(parts) != 2: - break - parts[0] += sep - round_len = len(tokenizer(rou)['input_ids']) - 1 - instruction_len = len(tokenizer(parts[0])['input_ids']) - 3 - - target[cur_len: cur_len + instruction_len] = IGNORE_TOKEN_ID - - cur_len += round_len - target[cur_len:] = IGNORE_TOKEN_ID - - if cur_len < seq_length: - if cur_len != total_len: - target[:] = IGNORE_TOKEN_ID - else: - target = target[:seq_length] - targets.append(target.tolist()) - - input_ids = np.array(input_ids, dtype=np.int32) - targets = np.array(targets, dtype=np.int32) - - return dict( - input_ids=input_ids, - labels=targets, - ) - - -class SupervisedDataset: - """Dataset for supervised fine-tuning.""" - - def __init__(self, raw_data, tokenizer, seq_length): - super(SupervisedDataset, self).__init__() - - sources = [example["conversations"] for example in raw_data] - data_dict = preprocess(sources, tokenizer, seq_length) - - self.input_ids = data_dict.get("input_ids") - self.labels = data_dict.get("labels") - - def __len__(self): - return len(self.input_ids) - - def __getitem__(self, i): - return dict( - input_ids=self.input_ids[i], - labels=self.labels[i] - ) - - -def tokenize_wiki(tokenizer, file_path, seq_length, repeat): - """tokenize wikitext-2/wikitext-103 dataset""" - content = [] - with open(file_path, 'r', encoding='utf-8') as f: - for para in wikitext_clean(f.read()).split("\n\n"): - if para and para.strip().startswith('=') is False: - content += tokenizer(para)['input_ids'] - content_out = [] - for _ in range(repeat): - content_out.extend(content) - content = content_out - for chunk in chunks(content, seq_length): - sample = {} - if len(chunk) == seq_length: - sample['input_ids'] = np.array(chunk, dtype=np.int32) - yield sample - - -# pylint: disable=C0111 -# pylint: disable=W0703 -def tokenize_qa(tokenizer, file_path, seq_length): - file = None - raw_data = None - try: - file = open(file_path, "r") - raw_data = json.load(file) - except FileNotFoundError as file_not_found_error: - logger.error(file_not_found_error) - except UnicodeDecodeError as decode_error: - logger.error(decode_error) - except IOError as io_error: - logger.error(io_error) - except Exception as exception: - logger.error(exception) - finally: - if file is not None: - file.close() - dataset_cls = SupervisedDataset(raw_data, tokenizer, seq_length) - for i, _ in enumerate(dataset_cls): - yield dataset_cls[i] - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--dataset_type', type=str, default='wiki', choices=['wiki', 'qa']) - parser.add_argument('--input_glob', type=str, default='./dataset/wikitext-2/wiki.train.tokens') - parser.add_argument('--output_file', type=str, default='./dataset/wiki8192/wiki8192') - parser.add_argument('--tokenizer', type=str, default='llama3', choices=['llama3']) - parser.add_argument('--model_file', type=str, default='./ckpt/llama3/tokenizer.model') - parser.add_argument('--file_partition', type=int, default=1) - parser.add_argument('--repeat', type=int, default=1) - parser.add_argument('--seq_length', type=int, default=8192) - args = parser.parse_args() - # pylint: disable=C0326 - out_dir, out_file = os.path.split(os.path.abspath(args.output_file)) - if not os.path.exists(out_dir): - os.mkdir(out_dir) - if args.dataset_type == 'wiki': - schema = {'input_ids': {"type": "int32", "shape": [-1]}, } - elif args.dataset_type == 'qa': - schema = {'input_ids': {"type": "int32", "shape": [-1]}, 'labels': {"type": "int32", "shape": [-1]}} - writer = FileWriter(file_name=args.output_file, - shard_num=args.file_partition) - writer.add_schema(schema, args.dataset_type) - - # Start to load tokenizer - if not os.path.exists(args.model_file): - raise FileNotFoundError(f"file {args.model_file} do not exists.") - - transforms_count = 0 - word_tokenizer = Llama3Tokenizer(vocab_file=args.model_file) - if hasattr(word_tokenizer, 'add_bos_token'): - word_tokenizer.add_bos_token = True - if hasattr(word_tokenizer, 'add_eos_token'): - word_tokenizer.add_eos_token = True - if args.dataset_type == 'wiki': - for x in tokenize_wiki(word_tokenizer, args.input_glob, args.seq_length + 1, args.repeat): - transforms_count += 1 - writer.write_raw_data([x]) - print("Transformed {} records.".format(transforms_count)) - elif args.dataset_type == 'qa': - for x in tokenize_qa(word_tokenizer, args.input_glob, args.seq_length + 1): - transforms_count += 1 - writer.write_raw_data([x]) - print("Transformed {} records.".format(transforms_count)) - else: - raise ValueError( - "Not support dataset type: {}".format(args.dataset_type)) - - writer.commit() - out_file = args.output_file - if args.file_partition > 1: - out_file += '0' - print("Transform finished, output files refer: {}".format(out_file)) diff --git a/research/llama3_1/llama3_1_tokenizer.py b/research/llama3_1/llama3_1_tokenizer.py deleted file mode 100644 index bcc083c87bfd444cdb42a1d5947491c831e0b186..0000000000000000000000000000000000000000 --- a/research/llama3_1/llama3_1_tokenizer.py +++ /dev/null @@ -1,244 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -"""llama3 tokenizer APIs.""" - -import base64 -from typing import Collection, Dict, List, Set, Union -import json -import unicodedata - -from mindformers.models.tokenization_utils import AddedToken, PreTrainedTokenizer -from mindformers.tools.register import MindFormerRegister, MindFormerModuleType -from mindformers.tools.utils import check_file - -try: - import tiktoken -except ImportError as e: - raise ImportError("Package 'tiktoken' required to run Llama3. please install it with pip.") from e - -PAT_STR = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| " \ - r"?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+" - - -def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]: - with open(tiktoken_bpe_file, "rb") as f: - contents = f.read() - return { - base64.b64decode(token): int(rank) - for token, rank in (line.split() for line in contents.splitlines() if line) - } - - -def _load_tokenizer_json(json_file): - with open(json_file, "rb") as f: - contents = json.loads(f.read()) - return { - bytes(token, encoding='utf8'): int(rank) - for token, rank in contents['model']['vocab'].items() - } - - -@MindFormerRegister.register(MindFormerModuleType.TOKENIZER) -class Llama3Tokenizer(PreTrainedTokenizer): - """Llama3 Tokenizer""" - VOCAB_FILES = {'vocab_file': 'tokenizer.json'} - FILE_LIST = [] - special_tokens: Dict[str, int] - - def __init__(self, - vocab_file, - bos_token="<|begin_of_text|>", - eos_token="<|end_of_text|>", - pad_token="<|reserved_special_token_0|>", - add_bos_token=False, - add_eos_token=False, - errors="replace", - num_reserved_special_tokens=256, - **kwargs): - pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token - - self.errors = errors - self.vocab_file = vocab_file - check_file(vocab_file, "tokenizer") - self.add_bos_token = add_bos_token - self.add_eos_token = add_eos_token - if vocab_file.split('.')[-1] == 'json': - self.mergeable_ranks = _load_tokenizer_json(vocab_file) - else: - self.mergeable_ranks = _load_tiktoken_bpe(vocab_file) # type: dict[bytes, int] - num_base_tokens = len(self.mergeable_ranks) - special_tokens = [ - "<|begin_of_text|>", - "<|end_of_text|>", - "<|reserved_special_token_0|>", - "<|reserved_special_token_1|>", - "<|reserved_special_token_2|>", - "<|reserved_special_token_3|>", - "<|start_header_id|>", - "<|end_header_id|>", - "<|reserved_special_token_4|>", - "<|eot_id|>", # end of turn - ] + [ - f"<|reserved_special_token_{i}|>" - for i in range(5, num_reserved_special_tokens - 5) - ] - self.special_tokens = { - token: num_base_tokens + i - for i, token in enumerate(special_tokens) - } - - self.tokenizer = tiktoken.Encoding( - "Llama3", - pat_str=PAT_STR, - mergeable_ranks=self.mergeable_ranks, - special_tokens=self.special_tokens, - ) - - self.decoder = { - v: k - for k, v in self.mergeable_ranks.items() - } # type: dict[int, bytes|str] - self.decoder.update({ - v: k - for k, v in self.special_tokens.items() - }) - - bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token - eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token - pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token - - super().__init__(bos_token=bos_token, - eos_token=eos_token, - pad_token=pad_token, - **kwargs) - - @property - def vocab_size(self): - return self.tokenizer.n_vocab - - def get_vocab(self): - """Returns vocab as a dict""" - vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} - vocab.update(self.added_tokens_encoder) - return vocab - - # override Tokenizer.convert_tokens_to_string() - def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str: - """ - Converts a sequence of tokens in a single string. - """ - text = "" - temp = b"" - for t in tokens: - if isinstance(t, str): - if temp: - text += temp.decode("utf-8", errors=self.errors) - temp = b"" - text += t - elif isinstance(t, bytes): - temp += t - else: - raise TypeError("token should only be of type types or str") - if temp: - text += temp.decode("utf-8", errors=self.errors) - return text - - # called by Tokenizer.convert_tokens_to_ids() & SpecialTokensMixin - def _convert_tokens_to_ids( - self, tokens: Union[bytes, str, List[Union[bytes, str]]] - ) -> Union[int, List[int]]: - """Convert the tokens to ids using vocab mapping""" - if isinstance(tokens, (str, bytes)): - return self._convert_token_to_id(tokens) - - ids = [] - for token in tokens: - ids.append(self._convert_token_to_id(token)) - return ids - - def _convert_token_to_id(self, token: Union[bytes, str]) -> int: - """Converts a token to an id using the vocab, special tokens included""" - if token in self.special_tokens: - return self.special_tokens[token] - if token in self.mergeable_ranks: - return self.mergeable_ranks[token] - raise ValueError("unknown token") - - # required by Tokenizer.convert_ids_to_tokens() of mindformers<=0.6 - def _convert_ids_to_tokens(self, input_id: int): - return self._convert_id_to_token(input_id) - - # called by Tokenizer.convert_ids_to_tokens() - def _convert_id_to_token(self, index: int) -> Union[bytes, str]: - """Converts an id to a token, special tokens included""" - if index in self.decoder: - return self.decoder[index] - raise ValueError("unknown ids") - - # pylint: disable=W0613 - def tokenize( - self, - text: str, - allowed_special: Union[Set, str] = "all", - disallowed_special: Union[Collection, str] = (), - **kwargs, - ) -> List[Union[bytes, str]]: - """ - Converts a string in a sequence of tokens. - - Args: - text (`str`): - The sequence to be encoded. - allowed_special (`Literal["all"]` or `set`): - The surface forms of the tokens to be encoded as special tokens in regular texts. - Default to "all". - disallowed_special (`Literal["all"]` or `Collection`): - The surface forms of the tokens that should not be in regular texts and trigger errors. - Default to an empty tuple. - - kwargs (additional keyword arguments, *optional*): - Will be passed to the underlying model specific encode method. - - Returns: - `List[bytes|str]`: The list of tokens. - """ - tokens = [] - text = unicodedata.normalize("NFC", text) - if self.add_bos_token: - tokens.insert(0, self.decoder[self.bos_token_id]) - - # this implementation takes a detour: text -> token id -> token surface forms - for t in self.tokenizer.encode( - text, allowed_special=allowed_special, disallowed_special=disallowed_special - ): - tokens.append(self.decoder[t]) - if self.add_eos_token: - tokens.append(self.decoder[self.eos_token_id]) - return tokens - - # pylint: disable=W0613 - def _decode( - self, - token_ids: Union[int, List[int]], - skip_special_tokens: bool = False, - errors: str = None, - **kwargs, - ) -> str: - """override Tokenizer._decode(), called by PreTrainedTokenizerBase.decode()""" - if isinstance(token_ids, int): - token_ids = [token_ids] - if skip_special_tokens: - token_ids = [i for i in token_ids if i != self.pad_token_id and i not in self.special_tokens.values()] - return self.tokenizer.decode(token_ids, errors=errors or self.errors) diff --git a/research/llama3_1/utils.py b/research/llama3_1/utils.py deleted file mode 100644 index 075d04ba1431beb6f8e662936c5f2d8d977d1789..0000000000000000000000000000000000000000 --- a/research/llama3_1/utils.py +++ /dev/null @@ -1,67 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -DEPRECATED MODULE - -This module is deprecated and will be removed in future releases. -LLaMA models' utils. -""" - - -def convert_model_config(configs): - """convert model config to dynamic-infer style""" - ffn_hidden_size = configs.hidden_size * 4 - if configs.intermediate_size is not None: - ffn_hidden_size = configs.intermediate_size - else: - if configs.ffn_dim_multiplier is not None: - ffn_hidden_size = int((configs.ffn_dim_multiplier + 0.01) * ffn_hidden_size) - ffn_hidden_size = int(2 * ffn_hidden_size / 3) - ffn_hidden_size = configs.multiple_of * ((ffn_hidden_size + configs.multiple_of - 1) // configs.multiple_of) - - configs.apply_query_key_layer_scaling = False - configs.apply_residual_connection_post_norm = False - configs.attention_dropout_rate = 0.0 - configs.attention_type = 'self_attn' - configs.ffn_hidden_size = ffn_hidden_size - configs.hidden_act = "silu" - configs.hidden_dropout_rate = 0.0 - configs.kv_num_heads = configs.num_heads if configs.n_kv_heads is None else configs.n_kv_heads - configs.layernorm_epsilon = configs.rms_norm_eps - configs.mask_func_type = "attn_mask_add" - configs.mlp_has_bias = False - configs.normalization = "RMSNorm" - configs.num_experst = None - configs.out_proj_has_bias = False - configs.param_init_dtype = configs.param_init_type - configs.layernorm_compute_dtype = configs.layernorm_compute_type - configs.residual_connection_dtype = configs.softmax_compute_type - configs.share_embedding_weight = False - configs.softmax_compute_dtype = configs.softmax_compute_type - configs.use_gqa = False - configs.mlp_has_gate = True - configs.post_norm = True - configs.recompute_granularity = None - configs.ffn_concat = configs.qkv_concat - configs.is_dynamic = True - - parallel_config = configs.parallel_config - parallel_config.tensor_parallel = parallel_config.model_parallel - parallel_config.expert_parallel = 1 - parallel_config.use_sequence_parallel = False - parallel_config.use_zero3 = False - configs.parallel_config = parallel_config - - return configs diff --git a/run_mindformer.py b/run_mindformer.py index 5d4476257c786bb90cef2365b2477e1630d6d7dd..3071803c3223d89bec49f455b34853ea13bcd1a7 100644 --- a/run_mindformer.py +++ b/run_mindformer.py @@ -66,7 +66,7 @@ def main(config): build_context(config) trainer = Trainer(config) - if config.run_mode == 'train' or config.run_mode == 'finetune': + if config.run_mode in ('train', 'finetune'): trainer.train() elif config.run_mode == 'eval': trainer.evaluate(eval_checkpoint=config.load_checkpoint) @@ -132,8 +132,6 @@ if __name__ == "__main__": parser.add_argument( '--load_checkpoint', default=None, type=str, help="load model checkpoint to train/finetune/eval/predict, " - "it is also support input model name, such as 'llama3_1_8b', " - "please refer to https://gitee.com/mindspore/mindformers#%E4%BB%8B%E7%BB%8D." "Default: None") parser.add_argument( '--src_strategy_path_or_dir', default=None, type=str, @@ -217,7 +215,7 @@ if __name__ == "__main__": for item in rest_args_ for i in item.split("=")] if len(rest_args_) % 2 != 0: - raise ValueError(f"input arg key-values are not in pair, please check input args. ") + raise ValueError("input arg key-values are not in pair, please check input args. ") if args_.config is not None and not os.path.isabs(args_.config): args_.config = os.path.join(work_path, args_.config) diff --git a/tests/st/test_grace_exit_save_ckpt/__init__.py b/tests/st/test_grace_exit_save_ckpt/__init__.py deleted file mode 100644 index 39250f7a209c43909f413f55827e4bf534a72d25..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/__init__.py +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -"""test resume.""" diff --git a/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py b/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py deleted file mode 100644 index 441dc470ae6d557a0c0ce60e584b48a97c33e02c..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py +++ /dev/null @@ -1,83 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -Test module for testing resume training from specified checkpoint. -How to run this: -pytest tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py -""" -import os -from glob import glob -import numpy as np - -from mindspore.dataset import GeneratorDataset - -from mindformers import build_context -from mindformers.tools.utils import ( - get_epoch_and_step_from_ckpt_name -) -from mindformers.trainer import Trainer -from mindformers.models.llama import LlamaForCausalLM, LlamaConfig -from mindformers.tools.register import MindFormerConfig - - -SEED = 42 -NUM_LAYERS = 2 -NUM_HEADS = 4 -HIDDEN_SIZE = 512 -SEQ_LENGTH = 1024 -DATA_SIZE = 1024 - - -def generator(): - """dataset generator""" - for i in range(DATA_SIZE): - np.random.seed(SEED + i) - input_ids = np.random.randint(low=0, high=DATA_SIZE, size=(SEQ_LENGTH + 1,)).astype(np.int32) - yield input_ids - - -def get_checkpoints_path(checkpoint_dir): - """get checkpoints path""" - checkpoints_path = glob(os.path.join(checkpoint_dir, "*.ckpt")) - checkpoints_path.sort(key=get_epoch_and_step_from_ckpt_name) - return checkpoints_path - - -def llama_trainer_train_from_instance(): - """ - Feature: Create Trainer From Instance - Description: Test Trainer API to train from self-define instance API. - Expectation: TypeError - """ - # Config definition - - config = MindFormerConfig("./test_grace_exit_save_ckpt.yaml") - build_context(config) - - model_config = LlamaConfig(num_layers=NUM_LAYERS, seq_length=SEQ_LENGTH, - num_heads=NUM_HEADS, hidden_size=HIDDEN_SIZE, - parallel_config=config.parallel_config) - model = LlamaForCausalLM(model_config) - - # Training using first dataset. - dataset = GeneratorDataset(generator, column_names=["input_ids"]) - dataset = dataset.batch(batch_size=8) - - trainer = Trainer(model=model, args=config, train_dataset=dataset) - - trainer.train(train_checkpoint=False) - - -llama_trainer_train_from_instance() diff --git a/tests/st/test_grace_exit_save_ckpt/graceful_exit.json b/tests/st/test_grace_exit_save_ckpt/graceful_exit.json deleted file mode 100644 index 65610ba3421930f83d086999aaa78df0c1bbeaf9..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/graceful_exit.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "GracefulExit": 1 -} \ No newline at end of file diff --git a/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh b/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh deleted file mode 100644 index 1358c6368f9d4f033e086daad614a771f1788edc..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh +++ /dev/null @@ -1,26 +0,0 @@ -#!/bin/bash -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -set -e -BASE_PATH=$(cd "$(dirname $0)"; pwd) -USE_DEVICE_NUM=$1 - -msrun --worker_num=${USE_DEVICE_NUM} \ - --local_worker_num=${USE_DEVICE_NUM} \ - --master_port=8118 \ - --log_dir=msrun_log \ - --join=True \ - --cluster_time_out=300 \ - ${BASE_PATH}/grace_exit_save_ckpt.py > grace_exit_save_ckpt.log 2>&1 diff --git a/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml b/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml deleted file mode 100644 index db9222f64da500816c5f8e30486259c4e73defa8..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml +++ /dev/null @@ -1,160 +0,0 @@ -seed: 0 -output_dir: '' # path to save checkpoint/strategy -load_checkpoint: "" -src_strategy_path_or_dir: '' -auto_trans_ckpt: False # If true, auto transform load_checkpoint to load in distributed model -only_save_strategy: False -resume_training: False -use_graceful_exit: True -run_mode: 'train' - -# trainer config -trainer: - type: CausalLanguageModelingTrainer - model_name: 'llama3_1_8b' - -# runner config -runner_config: - epochs: 2 - batch_size: 1 - sink_mode: True - sink_size: 1 - -# optimizer -optimizer: - type: AdamW - betas: [0.9, 0.95] - eps: 1.e-8 - -# lr schedule -lr_schedule: - type: CosineWithWarmUpLR - learning_rate: 1.e-5 - lr_end: 0.0 - warmup_ratio: 0.03 - total_steps: -1 # -1 means it will load the total steps of the dataset - -# dataset -train_dataset: &train_dataset - data_loader: - type: MindDataset - dataset_dir: "" - shuffle: True - input_columns: ["input_ids", "labels"] # "input_ids", "labels" , labels are used in instruction finetune. - num_parallel_workers: 8 - python_multiprocessing: False - drop_remainder: True - batch_size: 6 - repeat: 1 - numa_enable: False - prefetch_size: 1 -train_dataset_task: - type: CausalLanguageModelDataset - dataset_config: *train_dataset -# if True, do evaluate during the training process. if false, do nothing. -# note that the task trainer should support _evaluate_in_training function. -do_eval: False - -use_parallel: True -# parallel context config -parallel: - parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel - gradients_mean: False - enable_alltoall: False - full_batch: True - search_mode: "sharding_propagation" - enable_parallel_optimizer: False - strategy_ckpt_save_file: "./ckpt_strategy.ckpt" - parallel_optimizer_config: - gradient_accumulation_shard: False - parallel_optimizer_threshold: 64 -# default parallel of device num = 8 for Atlas 800T A2 -parallel_config: - data_parallel: 2 - model_parallel: 4 - pipeline_stage: 1 - use_seq_parallel: False - micro_batch_num: 1 - vocab_emb_dp: True - gradient_aggregation_group: 4 -# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process. -micro_batch_interleave_num: 1 - -# recompute config -recompute_config: - recompute: True - select_recompute: False - parallel_optimizer_comm_recompute: False - mp_comm_recompute: True - recompute_slice_activation: True - -# callbacks -callbacks: - - type: MFLossMonitor - - type: OnRequestExit - save_ckpt: True - save_mindir: False - file_name: Llama - directory: "./grace_ckpt/" - config_file: "./graceful_exit.json" - - -# mindspore context init config -context: - mode: 0 #0--Graph Mode; 1--Pynative Mode - device_target: "Ascend" - max_call_depth: 10000 - max_device_memory: "58GB" - save_graphs: False - save_graphs_path: "./graph" - device_id: 0 - jit_config: - jit_level: "O1" - memory_optimize_level: "O0" - -# model config -model: - model_config: - type: LlamaConfig - batch_size: 1 # add for increase predict - seq_length: 1024 - hidden_size: 512 - num_layers: 2 - num_heads: 4 - n_kv_heads: 8 - vocab_size: 128256 - intermediate_size: 14336 - rms_norm_eps: 1.0e-5 - bos_token_id: 128000 - eos_token_id: 128001 - pad_token_id: 128002 - ignore_token_id: -100 - compute_dtype: "bfloat16" - layernorm_compute_type: "float32" - softmax_compute_type: "float32" - rotary_dtype: "float32" - param_init_type: "float16" - embedding_init_type: "bfloat16" - use_past: False - scaling_factor: 1.0 - theta: 500000 - extend_method: "None" # support "None", "PI", "NTK" - use_flash_attention: True # FA can accelerate training or finetune - offset: 0 - fine_grain_interleave: 1 - checkpoint_name_or_path: "" - repetition_penalty: 1 - max_decode_length: 512 - top_k: 3 - top_p: 1 - do_sample: False - arch: - type: LlamaForCausalLM - - -# wrapper cell config -runner_wrapper: - type: MFTrainOneStepCell - scale_sense: 1.0 - use_clip_grad: True - diff --git a/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py b/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py deleted file mode 100644 index 0d172c7856ff33112e52054da1d30b7f9dde9eb8..0000000000000000000000000000000000000000 --- a/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py +++ /dev/null @@ -1,49 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -""" -Test module for testing resume training from specified checkpoint. -How to run this: -pytest tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py -""" -import os - -class TestSaveTtpCkpt: - """A test class for testing save_ttp_ckpt.""" - - def test_train(self): - """ - Feature: Trainer.train() - Description: Test parallel grace_exit_save_ckpt for train. - Expectation: AssertionError - """ - sh_path = os.path.split(os.path.realpath(__file__))[0] - ret = os.system(f"bash {sh_path}/msrun_launch.sh 8") - assert ret == 0 - checkpoint_dir = "./grace_ckpt" - path = "./msrun_log/worker_0.log" - flag = False - assert os.path.exists(path) - with open(path, 'r') as file: - content = file.read() - if "Graceful exit is triggered, stop training" in content: - flag = True - assert flag - for _, _, filenames in os.walk(checkpoint_dir): - for filename in filenames: - assert filename.endswith('.ckpt') - for _, _, filenames in os.walk(checkpoint_dir): - for filename in filenames: - if os.path.exists(filename): - os.remove(filename) diff --git a/tests/st/test_ut/test_models/test_build_config.py b/tests/st/test_ut/test_models/test_build_config.py deleted file mode 100644 index 4a2d3481ad82545739bcbf8008e652643c14dac6..0000000000000000000000000000000000000000 --- a/tests/st/test_ut/test_models/test_build_config.py +++ /dev/null @@ -1,28 +0,0 @@ -# Copyright 2024 Huawei Technologies Co., Ltd -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# ============================================================================ -"""test build config.""" -from mindformers import MindFormerConfig -from mindformers.models.build_config import build_model_config -from mindformers.models.llama import LlamaConfig - - -class TestBuildModelConfig: - """A test class for testing build_model_config() method.""" - - def test_build_llama_config(self): - """test build llama config from yaml.""" - config = MindFormerConfig("research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml") - model_config = build_model_config(config.model.model_config) - assert isinstance(model_config, LlamaConfig)