diff --git a/docs/api/api_python/mindformers/mindformers.AutoConfig.rst b/docs/api/api_python/mindformers/mindformers.AutoConfig.rst
index fb5d183d18b0bc40428ced173eaea970dcc411d5..b07bc172c01493f13cd55a93e76c9885bcef2ba2 100644
--- a/docs/api/api_python/mindformers/mindformers.AutoConfig.rst
+++ b/docs/api/api_python/mindformers/mindformers.AutoConfig.rst
@@ -30,7 +30,7 @@ mindformers.AutoConfig
             这个API正处于实验阶段，在下一个版本中可能会有一些突破性的变化。
 
         参数：
-            - **model_type** (str) - 模型简称，类似'llama3_1'。
+            - **model_type** (str) - 模型简称，类似'qwen2_5'。
             - **config** (PretrainedConfig) - 用于注册的类。
             - **exist_ok** (bool, 可选) - 为True时，若model_type已存在也不报错。默认值： ``False`` 。
 
diff --git a/mindformers/models/auto/configuration_auto.py b/mindformers/models/auto/configuration_auto.py
index 8b27300b6cf1131d20094cdb657eeb034767e0d0..963eed1752f48fd359081cd557224a58eca11b89 100644
--- a/mindformers/models/auto/configuration_auto.py
+++ b/mindformers/models/auto/configuration_auto.py
@@ -417,7 +417,7 @@ class AutoConfig:
             The API is experimental and may have some slight breaking changes in the next releases.
 
         Args:
-            model_type (str): The model type like "llama3_1".
+            model_type (str): The model type like "qwen2_5".
             config (PretrainedConfig): The config to register.
             exist_ok (bool, optional): If set to True, no error will be raised even if model_type already exists.
                 Default: ``False``.
diff --git a/research/llama3_1/README.md b/research/llama3_1/README.md
deleted file mode 100644
index 30561b2dcbf0dec83361a87d5a2105aeae191348..0000000000000000000000000000000000000000
--- a/research/llama3_1/README.md
+++ /dev/null
@@ -1,453 +0,0 @@
-# Llama 3.1
-
-## 模型描述
-
-Llama 3.1，是开源Llama系列的最新产品，目前有三个版本：Llama 3.1-8B，Llama 3.1-70B，Llama 3.1-405B。
-Llama 3.1在来自公开可用来源的超过15T的数据上进行了预训练。微调数据包括公开可用的指令数据集，以及超过1000万个人工标注的示例。
-模型支持上下文窗口长度128K，并使用了新的分词器，词汇表大小达到128256个，采用了分组查询注意力机制(GQA)。
-Llama 3.1模型是类GPT模型，是一个生成式的语言模型，主要是用于预测下一个单词。
-目前Mindformers支持Llama 3.1-8B，Llama 3.1-70B，敬请期待Llama 3.1-405B。
-
-## 模型性能
-
-以下模型性能均由Atlas 800T A2硬件环境下测试得出。
-
-| Config                                                 |      Task       | Datasets | SeqLength | Performance  |  Phase  |
-|:-------------------------------------------------------|:---------------:|:--------:|:---------:|:------------:|:-------:|
-| [llama3_1_8b](llama3_1_8b/predict_llama3_1_8b.yaml)    | text_generation |    -     |   2048    | 591 tokens/s | Predict |
-| [llama3_1_70b](llama3_1_70b/predict_llama3_1_70b.yaml) | text_generation |    -     |   4096    | 509 tokens/s | Predict |
-
-以下模型性能均由Atlas 900 A2 PoDc硬件环境下测试得出。
-
-| Config                                                  |      Task       | Datasets | SeqLength |   Performance   |  Phase   |
-|:--------------------------------------------------------|:---------------:|:--------:|:---------:|:---------------:|:--------:|
-| [llama3_1_8b](llama3_1_8b/finetune_llama3_1_8b.yaml)    | text_generation |  alpaca  |   8192    | 2703 tokens/s/p | Finetune |
-| [llama3_1_70b](llama3_1_70b/finetune_llama3_1_70b.yaml) | text_generation |  alpaca  |   8192    | 337 tokens/s/p  | Finetune |
-
-## 模型文件
-
-`Llama 3.1` 基于 `mindformers` 实现，主要涉及的文件有：
-
-1. 模型具体实现：
-
-   ```text
-   mindformers/models/llama
-       ├── __init__.py
-       ├── llama.py                  # 模型实现
-       ├── llama_config.py           # 模型配置项
-       ├── llama_layer.py            # llama网络层定义
-       ├── llama_processor.py        # llama预处理
-       └── llama_transformer.py      # transformer层实现
-   ```
-
-2. 模型配置：
-
-   ```text
-   research/llama3_1
-       ├──llama3_1_8b
-       │    ├── predict_llama3_1_8b.yaml     # 8B推理配置
-       │    └── finetune_llama3_1_8b.yaml    # 8B全量微调启动配置
-       └──llama3_1_70b
-            ├── predict_llama3_1_70b.yaml    # 70B推理配置
-            └── finetune_llama3_1_70b.yaml   # 70B全量微调启动配置
-   ```
-
-3. 数据预处理脚本和任务启动脚本：
-
-   ```text
-   research/llama3_1
-       ├── llama3_1_tokenizer.py      # llama3_1 tokenizer处理脚本
-       ├── llama3_1_conversation.py   # 微调数据集处理，将原始alpaca转换为对话形式alpaca
-       └── llama3_1_preprocess.py     # llama模型的mindrecord数据处理脚本
-   ```
-
-## 环境及数据准备
-
-### 安装环境
-
-MindFormers软硬件配套关系以及安装参考[环境安装指南](../../README_CN.md#源码编译安装)
-和[版本匹配关系](../../README_CN.md#版本匹配关系)。
-
-### 数据集及权重准备
-
-#### 数据集下载
-
-MindFormers提供**alpaca**作为[微调](#微调)数据集。
-
-| 数据集名称   |              适用模型              |   适用阶段   |                                      下载链接                                       |
-|:--------|:------------------------------:|:--------:|:-------------------------------------------------------------------------------:|
-| alpaca  | llama3_1-8b <br/> llama3_1-70b | Finetune | [Link](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) |
-
-数据预处理中所用的`tokenizer.model`可以参考[模型权重下载](#模型权重下载)进行下载。
-
-- **alpaca 数据预处理**
-
-    1. 执行`mindformers/tools/dataset_preprocess/llama/alpaca_converter.py`，使用fastchat工具添加prompts模板，将原始数据集转换为多轮对话格式。
-
-       ```shell
-       python alpaca_converter.py \
-         --data_path /{path}/alpaca_data.json \
-         --output_path /{path}/alpaca-data-conversation.json
-
-       # 参数说明
-       data_path:   输入下载的文件路径
-       output_path: 输出文件的保存路径
-       ```
-
-    2. 执行`research/llama3_1/llama3_1_preprocess.py`，生成Mindrecord数据，将带有prompt模板的数据转换为mindrecord格式。
-
-       ```shell
-       # 此工具依赖fschat工具包解析prompt模板, 请提前安装fschat >= 0.2.13 python = 3.9
-       python llama3_1_preprocess.py \
-         --dataset_type qa \
-         --input_glob /{path}/alpaca-data-conversation.json \
-         --model_file /{path}/tokenizer.model \
-         --seq_length 8192 \
-         --output_file /{path}/alpaca-fastchat8192.mindrecord
-
-       # 参数说明
-       dataset_type: 预处理数据类型
-       input_glob:   转换后的alpaca的文件路径
-       model_file:   模型tokenizer.model文件路径
-       seq_length:   输出数据的序列长度
-       output_file:  输出文件的保存路径
-       ```
-
-> 数据处理时候注意bos，eos，pad等特殊`ids`要和配置文件中`model_config`里保持一致。
-
-#### 模型权重下载
-
-MindFormers暂时没有提供权重，用户可以下载HuggingFace官方权重经过[模型权重转换](#模型权重转换)后进行使用。
-
-词表下载链接：[tokenizer.model](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)
-
-| 模型名称         | MindSpore权重 |                        HuggingFace权重                         |
-|:-------------|:-----------:|:------------------------------------------------------------:|
-| Llama3_1-8B  |      -      | [Link](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B)  |
-| Llama3_1-70B |      -      | [Link](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B) |
-
-> 注: 请自行申请huggingface上llama3_1使用权限，并安装transformers=4.40版本
-
-#### 模型权重转换
-
-下载完成后，运行`mindformers/convert_weight.py`转换脚本，将huggingface的权重转换为完整的ckpt权重。
-
-```shell
-python convert_weight.py --model llama --input_path TORCH_CKPT_DIR --output_path {path}/MS_CKPT_NAME --dtype bf16
-
-# 参数说明
-model:       模型名称
-input_path:  下载HuggingFace权重的文件夹路径
-output_path: 转换后的MindSpore权重文件保存路径
-dtype:       转换权重的精度
-```
-
-## 微调
-
-### 全参微调
-
-MindFormers提供`Llama3_1-8b`单机多卡以及`Llama3_1-70b`多机多卡的的微调示例，过程中使用`alpaca`
-数据集对模型进行微调，数据集可以参考[数据集下载](#数据集下载)获得。
-
-#### 单机训练
-
-以Llama3_1-8b为例，Llama3_1-8B在Atlas 800T A2上训练，支持**单机/多机训练**。
-
-使用`finetune_llama3_1_8b.yaml`进行训练，或修改默认配置文件中的`model_config.seq_length`
-，使训练配置与数据集的`seq_length`保持一致。
-
-执行命令启动微调任务，在单机上拉起任务。
-
-```shell
-# 单机8卡默认快速启动
-bash scripts/msrun_launcher.sh "run_mindformer.py \
- --register_path research/llama3_1 \
- --config research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml \
- --load_checkpoint model_dir/xxx.ckpt \
- --auto_trans_ckpt True \
- --use_parallel True \
- --run_mode finetune \
- --train_data dataset_dir"
-
-# 参数说明
-config:          配置文件路径
-load_checkpoint: 权重文件路径
-auto_trans_ckpt: 自动权重转换开关
-run_mode:        运行模式, 微调时设置为finetune
-train_data:      训练数据集路径
-```
-
-#### 多机训练
-
-以llama3_1-70b为例，使用`finetune_llama3_1_70b.yaml`配置文件，执行8机64卡微调。需要先对权重进行切分，切分权重可以参见[权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html#%E6%9D%83%E9%87%8D%E5%88%87%E5%88%86%E4%B8%8E%E5%90%88%E5%B9%B6)（如果是共享盘也可以开启自动权重转换，使用完整权重）。
-
-多机多卡执行脚本进行分布式训练需要分别在不同节点运行脚本，并将参数MASTER_ADDR设置为主节点的ip地址，所有节点设置的ip地址相同，不同节点之间仅参数NODE_RANK不同，各个参数位置含义参见[使用指南](../../README_CN.md#三使用指南)。
-
-在每台机器上运行以下命令，多机运行命令在每台机器上仅`node_num` 不同，从0开始计数，命令中主节点ip为第0个节点ip。
-
-```shell
-# 节点0，设0节点ip为192.168.1.1，作为主节点ip，总共64卡且每个节点8卡
-# 节点0、节点1、...节点7 依此修改node_num，比如8机，node_num为0~7。
-bash scripts/msrun_launcher.sh "run_mindformer.py \
- --register_path research/llama3_1 \
- --config research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml \
- --load_checkpoint model_dir/xxx.ckpt \
- --train_data dataset_dir \
- --auto_trans_ckpt False \
- --use_parallel True \
- --run_mode finetune" \
- 64 8 {主节点ip} 8118 {node_num} output/msrun_log False 300
-```
-
-## 推理
-
-MindFormers提供`Llama3_1-8b`的快速推理脚本，脚本主要通过generate高阶接口实现，支持单卡推理。推理输入默认不添加bos字符，如果需要添加可在config中增加add_bos_token选项。
-
-```shell
-# 脚本使用
-bash scripts/examples/llama3/run_llama3_predict.sh PARALLEL CONFIG_PATH CKPT_PATH DEVICE_NUM
-
-# 参数说明
-PARALLEL:    是否使用多卡推理, 'single'表示单卡推理, 'parallel'表示多卡推理
-CONFIG_PATH: 模型配置文件路径
-CKPT_PATH:   模型权重文件路径
-VOCAB_FILE:  词表路径
-DEVICE_NUM:  使用卡数, 仅开启多卡推理时生效
-```
-
-### 单卡推理
-
-以`Llama3_1-8b`单卡推理为例。
-
-```shell
-bash scripts/examples/llama3/run_llama3_predict.sh single \
- research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml \
- path/to/llama3_1_8b.ckpt \
- path/to/tokenizer.model
-```
-
-### 多卡推理
-
-以`Llama3_1-70b`4卡推理为例。Llama3_1-70b权重较大，建议先进行权重切分，参见[权重切分与合并](https://www.mindspore.cn/mindformers/docs/zh-CN/master/feature/ckpt.html#%E6%9D%83%E9%87%8D%E5%88%87%E5%88%86%E4%B8%8E%E5%90%88%E5%B9%B6)。
-
-```shell
-bash scripts/examples/llama3/run_llama3_predict.sh parallel \
- research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml \
- path/to/model_dir \
- path/to/tokenizer.model 4
-```
-
-## 基于MindIE的服务化推理
-
-MindIE，全称Mind Inference Engine，是华为昇腾针对AI全场景业务的推理加速套件。
-
-MindFormers承载在模型应用层MindIE-LLM中，MindIE-LLM是大语言模型推理框架，提供API支持大模型推理能力。
-
-MindIE安装流程请参考[MindIE服务化部署文档](https://www.mindspore.cn/mindformers/docs/zh-CN/master/guide/deployment.html)。
-
-以下例子默认已完成MindIE安装部署且仅适用于**MindIE RC3版本**，且安装路径均为默认路径`/usr/local/Ascend/`。
-
-### 单卡推理
-
-此例子使用llama3_1-8B模型演示。
-
-#### 修改MindIE启动配置
-
-打开mindie-service中的config.json文件，修改server相关配置。
-
-```bash
-vim /usr/local/Ascend/mindie/1.0.RC3/mindie-service/conf/config.json
-```
-
-需要关注以下字段的配置
-
-1. `ModelDeployConfig.ModelConfig.backendType`
-
-   该配置为对应的后端类型，必填"ms"。
-
-   ```json
-   "backendType": "ms"
-   ```
-
-   2. `ModelDeployConfig.ModelConfig.modelWeightPath`
-
-      该配置为模型配置文件目录，放置模型和tokenizer等相关文件。
-
-      以llama3_1-8B为例，`modelWeightPath`的组织结构如下：
-
-      ```reStructuredText
-      mf_model
-       └── llama3_1_8b
-              ├── config.json                             # 模型json配置文件
-              ├── tokenizer.model                         # 模型vocab文件，hf上对应模型下载
-              ├── predict_llama3_1_8b.yaml                # 模型yaml配置文件
-              ├── llama3_1_tokenizer.py                   # 模型tokenizer文件,从mindformers仓中research目录下找到对应模型复制
-              └── llama3_1_8b.ckpt                        # 单卡模型权重文件
-      ```
-
-      predict_llama3_1_8b.yaml需要关注以下配置：
-
-      ```yaml
-      load_checkpoint: '/mf_model/llama3_1_8b/llama3_1_8b.ckpt' # 为存放模型单卡权重文件路径
-      use_parallel: False
-      model:
-        model_config:
-          type: LlamaConfig
-          auto_map:
-            AutoTokenizer: [llama3_1_tokenizer.Llama3Tokenizer, null]
-      processor:
-        tokenizer:
-          vocab_file: "/mf_model/llama3_1_8b/tokenizer.model"  #vocab文件路径
-      ```
-
-      模型的config.json文件可以使用`save_pretrained`接口生成，示例如下：
-
-      ```python
-      from mindformers import AutoConfig
-
-      model_config = AutoConfig.from_pretrained("/mf_model/llama3_1_8b/predict_llama3_1_8b.yaml ")
-      model_config.save_pretrained(save_directory="/mf_model/llama3_1_8b", save_json=True)
-      ```
-
-      模型权重下载和转换可参考 [权重格式转换](https://www.mindspore.cn/docs/zh-CN/master/api_python/mindspore/mindspore.ckpt_to_safetensors.html)。
-
-      准备好模型配置目录后，设置参数`modelWeightPath`为该目录路径。
-
-```json
-   "modelWeightPath": "/mf_model/llama3_1_8b"
-```
-
-最终修改完后的config.json如下：
-
-```json
-{
-    "Version": "1.0.0",
-    "LogConfig" :
-    {
-        "logLevel" : "Info",
-        "logFileSize" : 20,
-        "logFileNum" : 20,
-        "logPath" : "logs/mindservice.log"
-    },
-
-    "ServerConfig" :
-    {
-        "ipAddress" : "127.0.0.1",
-        "managementIpAddress": "127.0.0.2",
-        "port" : 1025,
-        "managementPort" : 1026,
-        "metricsPort" : 1027,
-        "maxLinkNum" : 1000,
-        "httpsEnabled" : false,
-        "fullTextEnabled" : false,
-        "tlsCaPath" : "security/ca/",
-        "tlsCaFile" : ["ca.pem"],
-        "tlsCert" : "security/certs/server.pem",
-        "tlsPk" : "security/keys/server.key.pem",
-        "tlsPkPwd" : "security/pass/key_pwd.txt",
-        "tlsCrl" : "security/certs/server_crl.pem",
-        "managementTlsCaFile" : ["management_ca.pem"],
-        "managementTlsCert" : "security/certs/management/server.pem",
-        "managementTlsPk" : "security/keys/management/server.key.pem",
-        "managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
-        "managementTlsCrl" : "security/certs/management/server_crl.pem",
-        "kmcKsfMaster" : "tools/pmt/master/ksfa",
-        "kmcKsfStandby" : "tools/pmt/standby/ksfb",
-        "inferMode" : "standard",
-        "pdInterNodeTLSEnabled": false,
-        "pdCommunicationPort": 1121,
-        "interNodeTlsCaFile" : "security/grpc/ca/ca.pem",
-        "interNodeTlsCert" : "security/grpc/certs/server.pem",
-        "interNodeTlsPk" : "security/grpc/keys/server.key.pem",
-        "interNodeTlsPkPwd" : "security/grpc/pass/key_pwd.txt",
-        "interCommTlsCrl" : "security/certs/server_crl.pem",
-        "interNodeKmcKsfMaster": "tools/pmt/master/ksfa",
-        "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb"
-    },
-
-    "BackendConfig": {
-        "backendName" : "mindieservice_llm_engine",
-        "modelInstanceNumber" : 1,
-        "npuDeviceIds" : [[0]],
-        "tokenizerProcessNumber" : 8,
-        "multiNodesInferEnabled": false,
-        "multiNodesInferPort": 1120,
-        "interNodeTLSEnabled": true,
-        "interNodeTlsCaFile": "security/grpc/ca/ca.pem",
-        "interNodeTlsCert": "security/grpc/certs/server.pem",
-        "interNodeTlsPk": "security/grpc/keys/server.key.pem",
-        "interNodeTlsPkPwd": "security/grpc/pass/mindie_server_key_pwd.txt",
-        "interNodeTlsCrl" : "security/grpc/certs/server_crl.pem",
-        "interNodeKmcKsfMaster": "tools/pmt/master/ksfa",
-        "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb",
-        "ModelDeployConfig":
-        {
-            "maxSeqLen" : 2560,
-            "maxInputTokenLen" : 2048,
-            "truncation" : false,
-            "ModelConfig" : [
-                {
-                    "modelInstanceType": "Standard",
-                    "modelName" : "llama3_1_8b",
-                    "modelWeightPath" : "/mf_model/llama3_1_8b",
-                    "worldSize" : 1,
-                    "cpuMemSize" : 16,
-                    "npuMemSize" : 16,
-                    "backendType": "ms"
-                }
-            ]
-        },
-
-        "ScheduleConfig":
-        {
-            "templateType": "Standard",
-            "templateName" : "Standard_LLM",
-            "cacheBlockSize" : 128,
-
-            "maxPrefillBatchSize" : 50,
-            "maxPrefillTokens" : 8192,
-            "prefillTimeMsPerReq" : 150,
-            "prefillPolicyType" : 0,
-
-            "decodeTimeMsPerReq" : 50,
-            "decodePolicyType" : 0,
-
-            "maxBatchSize" : 200,
-            "maxIterTimes" : 512,
-            "maxPreemptCount" : 0,
-            "supportSelectBatch" : false,
-            "maxQueueDelayMicroseconds" : 5000
-        }
-    }
-}
-```
-
-> 注：为便于测试，`httpsEnabled`参数设置为`false`，忽略后续https通信相关参数。
-
-#### 启动服务
-
-```bash
-cd /usr/local/Ascend/mindie/1.0.RC3/mindie-service
-nohup ./bin/mindieservice_daemon > output.log 2>&1 &
-tail -f output.log
-```
-
-打印如下信息，启动成功。
-
-```json
-Daemon start success!
-```
-
-#### 请求测试
-
-服务启动成功后，可使用curl命令发送请求验证，样例如下：
-
-```bash
-curl -w "\ntime_total=%{time_total}\n" -H "Accept: application/json" -H "Content-type: application/json" -X POST -d '{"inputs": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n请介绍一下自己<|im_end|>\n<|im_start|>assistant\n","stream": false}' http://127.0.0.1:1035/generate
-```
-
-返回推理结果验证成功：
-
-```json
-{"generated_text":"我叫小助手，专门为您服务的。<|im_end|>\n<"}
-```
diff --git a/research/llama3_1/infer/layers.py b/research/llama3_1/infer/layers.py
deleted file mode 100644
index 3af3965e1e40c24a52cf684bff67e5584ccd0a61..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/layers.py
+++ /dev/null
@@ -1,523 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-Layers
-"""
-import mindspore.common.dtype as mstype
-import mindspore.ops.functional as F
-import mindspore.ops.operations as P
-from mindspore import Parameter, Tensor, mint, nn, ops
-from mindspore.common.initializer import initializer
-
-from mindformers.parallel_core.inference.parallel_state import default_pgs
-from mindformers.parallel_core.inference.tensor_parallel.mappings import (gather_from_model_parallel_region,
-                                                                          reduce_from_model_parallel_region,
-                                                                          reduce_scatter_to_model_parallel_region,
-                                                                          scatter_to_model_parallel_region)
-from mindformers.parallel_core.inference.tensor_parallel.random import (TENSOR_PARALLEL_GENERATOR,
-                                                                        get_rng_tracer)
-from mindformers.parallel_core.inference.utils import divide
-from mindformers.version_control import check_valid_gmm_op
-from mindformers.models.utils import jit
-
-
-class ColumnParallelLinear(nn.Cell):
-    """
-    The dense layer with weight sliced on second dimension by tensor parallel size.
-    This layer implements the operation as:
-
-    .. math::
-        \\text{outputs} = \\text{inputs} * \\text{weight} + \\text{bias},
-
-    where :math:`inputs` is the input tensors, :math:`\\text{weight}` is a weight matrix created by the layer,
-    and :math:`\\text{bias}` is a bias vector created by the layer (only if has_bias is True).
-
-    Args:
-        input_size (int): The number of channels in the input space.
-        output_size (int): The number of channels in the output space.
-        config (dict): Parallel configuration.
-        weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values
-            of str refer to the function `initializer`. Default: 'normal'.
-        bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The values
-            of str refer to the function `initializer`. Default: 'zeros'.
-        bias (bool): Specifies whether the layer uses a bias vector. Default: True.
-        gather_output (bool): Specifies whether gather the output on each tensor parallel rank. Default: False.
-        skip_weight_param_allocation (bool): Specifies whether skip the initialization of weight parameter.
-            When set True, an weight tensor should be passed to construct function. Default: False.
-        is_expert (bool): Specifies whether this linear layer is an expert. Default: False.
-        transpose_b (bool): Specifies whether the weight parameter will be initialized as a transposed shape.
-        param_init_type (dtype.Number): The parameter initialization type. Default: mstype.float32.
-        compute_dtype (dtype.Number): The computation type. Default: mstype.float16.
-        expert_num (int): The number of expert. Default: 1.
-        tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs.
-
-    Inputs:
-        - **x** (Tensor) - Tensor of shape :math:`(*, in\\_channels)`. The `input_size` in `Args` should be equal
-          to :math:`in\\_channels` in `Inputs`.
-
-    Outputs:
-        Tensor of shape :math:`(*, out\\_channels)`.
-
-    Raises:
-        ValueError: `skip_weight_param_allocation=True` but weight_tensor is not passed to construct function.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(
-            self,
-            input_size,
-            output_size,
-            config,
-            weight_init="normal",
-            bias_init="zeros",
-            bias=True,
-            gather_output=False,
-            stride=1,
-            keep_master_weight_for_test=False,
-            skip_bias_add=False,
-            skip_weight_param_allocation=False,
-            embedding_activation_buffer=None,
-            grad_output_buffer=None,
-            is_expert=False,
-            tp_comm_buffer_name=None,
-            disable_grad_reduce=False,
-            transpose_b=True,
-            param_init_type=mstype.float32,
-            compute_dtype=mstype.float16,
-            expert_num=1,
-            tp_group=default_pgs,
-    ):
-        super(ColumnParallelLinear, self).__init__()
-        if stride > 1:
-            raise NotImplementedError("For ColumnParallelLinear, `stride > 1` is not supported for now, "
-                                      "but got `stride={}`".format(stride))
-        if keep_master_weight_for_test:
-            raise NotImplementedError("For ColumnParallelLinear, `keep_master_weight_for_test=True` "
-                                      "is not supported for now.")
-        if skip_bias_add:
-            raise NotImplementedError("For ColumnParallelLinear, `skip_bias_add=True` is not supported for now.")
-        if embedding_activation_buffer:
-            raise NotImplementedError("For ColumnParallelLinear, `embedding_activation_buffer` is not supported "
-                                      "for now.")
-        if grad_output_buffer:
-            raise NotImplementedError("For ColumnParallelLinear, `grad_output_buffer` is not supported for now.")
-        if tp_comm_buffer_name:
-            raise NotImplementedError("For ColumnParallelLinear, `tp_comm_buffer_name` is not supported for now.")
-        if disable_grad_reduce:
-            raise NotImplementedError("For ColumnParallelLinear, `disable_grad_reduce=True` is not supported for now.")
-
-        self.input_size = input_size
-        self.output_size = output_size
-        self.has_bias = bias
-        self.gather_output = gather_output
-        self.tp_group = tp_group
-        self.tensor_parallel_group_size = self.tp_group.size
-
-        self.output_size_per_partition = divide(output_size, self.tensor_parallel_group_size)
-        self.is_expert = is_expert
-        self.expert_num = expert_num
-        self.skip_weight_param_allocation = skip_weight_param_allocation
-        self.parallel_config = config
-        self.compute_dtype = compute_dtype
-
-        self.sequence_parallel = self.parallel_config.use_sequence_parallel
-        self.transpose_b = transpose_b if self.expert_num <= 1 else False
-
-        if self.sequence_parallel and self.tensor_parallel_group_size <= 1:
-            self.sequence_parallel = False
-
-        weight_shape = (self.output_size_per_partition, self.input_size) if self.transpose_b else (
-            self.input_size, self.output_size_per_partition)
-        if self.is_expert and self.expert_num > 1:
-            weight_shape = (self.expert_num,) + weight_shape
-            if check_valid_gmm_op(gmm_version='GroupedMatmulV4'):
-                self.matmul = ops.auto_generate.GroupedMatmulV4()
-            else:
-                self.matmul = ops.auto_generate.GroupedMatmul(split_item=3, group_type=0)
-        else:
-            self.matmul = P.MatMul(transpose_b=self.transpose_b)
-        with get_rng_tracer().rng_fork(TENSOR_PARALLEL_GENERATOR):
-            if not self.skip_weight_param_allocation:
-                self.weight = Parameter(initializer(weight_init, weight_shape, param_init_type), name="weight")
-
-        if self.has_bias:
-            self.bias = Parameter(
-                initializer(
-                    bias_init, (self.output_size_per_partition), param_init_type
-                ),
-                name="bias",
-            )
-            self.bias_add = P.Add()
-
-        self.cast = P.Cast()
-        self.shape = P.Shape()
-        self.reshape = P.Reshape()
-
-    @jit
-    def construct(self, input_parallel, weight=None, group_list=None):
-        """
-        Forward of ColumnParallelLinear.
-        Performs a linear transformation considering various parallel modes and data type conversions.
-        """
-
-        if weight is None and self.skip_weight_param_allocation:
-            raise ValueError("For ColumnParallelLinear, when skip_weight_param_allocation=True,"
-                             " weight should be passed to construct(), but got None.")
-
-        origin_dtype = F.dtype(input_parallel)
-        if self.skip_weight_param_allocation:
-            weight = self.cast(weight, self.compute_dtype)
-        else:
-            weight = self.cast(self.weight, self.compute_dtype)
-        input_parallel = self.cast(input_parallel, self.compute_dtype)
-
-        if self.sequence_parallel:
-            input_parallel = input_parallel.swapaxes(0, 1).contiguous()
-            input_parallel = self.gather_from_sp_region(input_parallel)
-            input_parallel = input_parallel.swapaxes(0, 1).contiguous()
-
-        output_shape = self.shape(input_parallel)[:-1] + (self.output_size_per_partition,)
-        input_parallel = self.reshape(input_parallel, (-1, self.input_size))
-        if self.is_expert and self.expert_num > 1:
-            if check_valid_gmm_op(gmm_version='GroupedMatmulV4'):
-                output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, None,
-                                              group_list, split_item=3, group_type=0, group_list_type=1)[0]
-            else:
-                output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None,
-                                              group_list)[0]
-
-        else:
-            output_parallel = self.matmul(input_parallel, weight)
-        if self.has_bias:
-            output_parallel = self.bias_add(
-                output_parallel, self.cast(self.bias, self.compute_dtype)
-            )
-        output_parallel = self.cast(output_parallel, origin_dtype)
-        output_parallel = self.reshape(output_parallel, output_shape)
-
-        if self.gather_output:
-            output = gather_from_model_parallel_region(output_parallel, self.tp_group)
-        else:
-            output = output_parallel
-        return output
-
-    def sharded_state_dict(self):
-        """provide the sharded state dict based on the config"""
-        w_shard = (self.tensor_parallel_group_size, 1) if self.transpose_b else (1, self.tensor_parallel_group_size)
-
-        if self.is_expert and self.expert_num > 1:
-            w_shard = (1, self.tensor_parallel_group_size, 1) if self.transpose_b \
-                else (1, 1, self.tensor_parallel_group_size)
-
-        state_dict = {}
-        if not self.skip_weight_param_allocation:
-            state_dict[self.weight.name] = {'shape': self.weight.shape,
-                                            'shard': w_shard}
-        if self.has_bias:
-            state_dict[self.bias.name] = {'shape': self.bias.shape,
-                                          'shard': (self.tensor_parallel_group_size,)}
-        return state_dict
-
-
-class RowParallelLinear(nn.Cell):
-    r"""
-    The dense layer with weight sliced on first dimension by tensor parallel size.
-    This layer implements the operation as:
-
-    .. math::
-        \text{outputs} = \text{inputs} * \text{weight} + \text{bias},
-
-    where :math:`inputs` is the input tensors, :math:`\text{weight}` is a weight matrix created by the layer,
-    and :math:`\text{bias}` is a bias vector created by the layer (only if has_bias is True).
-
-    Args:
-        input_size (int): The number of channels in the input space.
-        output_size (int): The number of channels in the output space.
-        config (dict): Parallel configuration.
-        input_is_parallel (bool): Specifies whether the input tensor has already been sliced on last dimension.
-        weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values
-            of str refer to the function `initializer`. Default: 'normal'.
-        bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The values
-            of str refer to the function `initializer`. Default: 'zeros'.
-        bias (bool): Specifies whether the layer uses a bias vector. Default: True.
-        skip_bias_add (bool): Specifies whether the layer doesn't need to add bias. Default: False.
-        is_expert (bool): Specifies whether this linear layer is an expert. Default: False.
-        transpose_b (bool): Specifies whether the weight parameter will be initialized as a transposed shape.
-        param_init_type (dtype.Number): The parameter initialization type. Default: mstype.float32.
-        compute_dtype (dtype.Number): The computation type. Default: mstype.float16.
-        expert_num (int): The number of expert. Default: 1.
-        tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs.
-
-    Inputs:
-        - **x** (Tensor) - Tensor of shape :math:`(*, in\_channels)`. The `input_size` in `Args` should be equal
-          to :math:`in\_channels` in `Inputs`.
-
-    Outputs:
-        Tensor of shape :math:`(*, out\_channels)`.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(
-            self,
-            input_size,
-            output_size,
-            config,
-            input_is_parallel,
-            weight_init="normal",
-            bias_init="zeros",
-            bias=True,
-            skip_bias_add=False,
-            stride=1,
-            keep_master_weight_for_test=False,
-            is_expert=False,
-            tp_comm_buffer_name=None,
-            transpose_b=True,
-            param_init_type=mstype.float32,
-            compute_dtype=mstype.float16,
-            expert_num=1,
-            delay_allreduce=False,
-            tp_group=default_pgs,
-    ):
-        super(RowParallelLinear, self).__init__()
-        if stride > 1:
-            raise NotImplementedError("For ColumnParallelLinear, `stride > 1` is not supported for now, "
-                                      "but got `stride={}`".format(stride))
-        if keep_master_weight_for_test:
-            raise NotImplementedError("For ColumnParallelLinear, `keep_master_weight_for_test=True` "
-                                      "is not supported for now.")
-        if tp_comm_buffer_name:
-            raise NotImplementedError("For ColumnParallelLinear, `tp_comm_buffer_name` is not supported for now.")
-
-        self.input_size = input_size
-        self.output_size = output_size
-        self.has_bias = bias
-        self.skip_bias_add = skip_bias_add
-        self.input_is_parallel = input_is_parallel
-        self.tp_group = tp_group
-        self.tensor_parallel_group_size = self.tp_group.size
-        self.input_size_per_partition = divide(input_size, self.tensor_parallel_group_size)
-        self.parallel_config = config
-        self.compute_dtype = compute_dtype
-        self.sequence_parallel = self.parallel_config.use_sequence_parallel
-        self.expert_num = expert_num
-        self.is_expert = is_expert
-        self.transpose_b = transpose_b if self.expert_num <= 1 else False
-        self.delay_allreduce = delay_allreduce
-
-        if self.sequence_parallel and not self.input_is_parallel:
-            raise RuntimeError(
-                "To enable `sequence_arallel`, `input_is_parallel` must be `True`"
-            )
-
-        if self.delay_allreduce and self.has_bias:
-            raise RuntimeError(
-                "In RowParallelLinear, `delay_allreduce` and `has_bias` cannot be enabled simultaneously, "
-                "otherwise the accuracy will be incorrect"
-            )
-
-        weight_shape = (self.output_size, self.input_size_per_partition) if self.transpose_b else (
-            self.input_size_per_partition, self.output_size)
-        bias_shape = (self.output_size,)
-        if self.is_expert and self.expert_num > 1:
-            weight_shape = (self.expert_num,) + weight_shape
-            bias_shape = (1, self.expert_num, 1) + bias_shape
-            if check_valid_gmm_op(gmm_version='GroupedMatmulV4'):
-                self.matmul = ops.auto_generate.GroupedMatmulV4()
-            else:
-                self.matmul = ops.auto_generate.GroupedMatmul(split_item=3, group_type=0)
-        else:
-            self.matmul = P.MatMul(transpose_b=self.transpose_b)
-        with get_rng_tracer().rng_fork(TENSOR_PARALLEL_GENERATOR):
-            self.weight = Parameter(
-                initializer(
-                    weight_init,
-                    weight_shape,
-                    param_init_type,
-                ),
-                name="weight",
-            )
-
-        if self.has_bias:
-            self.bias = Parameter(initializer(bias_init, bias_shape, param_init_type), name="bias")
-            self.bias_add = P.Add()
-
-        self.shape = P.Shape()
-        self.reshape = P.Reshape()
-        self.cast = P.Cast()
-
-    def construct(self, input_, group_list=None):
-        """
-        Forward of RowParallelLinear.
-        Performs a linear transformation considering various parallel modes and data type conversions.
-        """
-
-        if self.input_is_parallel:
-            input_parallel = input_
-        else:
-            input_parallel = scatter_to_model_parallel_region(input_, self.tp_group)
-
-        origin_dtype = F.dtype(input_parallel)
-        weight = self.cast(self.weight, self.compute_dtype)
-        input_parallel = self.cast(input_parallel, self.compute_dtype)
-        output_shape = self.shape(input_parallel)[:-1] + (self.output_size,)
-        input_parallel = self.reshape(input_parallel, (-1, self.input_size_per_partition))
-        if self.is_expert and self.expert_num > 1:
-            if check_valid_gmm_op(gmm_version='GroupedMatmulV4'):
-                output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None, None,
-                                              group_list, split_item=3, group_type=0, group_list_type=1)[0]
-            else:
-                output_parallel = self.matmul([input_parallel], [weight], None, None, None, None, None,
-                                              group_list)[0]
-        else:
-            output_parallel = self.matmul(input_parallel, weight)
-
-        if self.sequence_parallel:
-            output_parallel = output_parallel.swapaxes(0, 1).contiguous()
-            output = reduce_scatter_to_model_parallel_region(output_parallel, self.tp_group)
-            output = output.swapaxes(0, 1).contiguous()
-        else:
-            if self.delay_allreduce or self.skip_bias_add:
-                output = output_parallel
-            else:
-                output = reduce_from_model_parallel_region(output_parallel, self.tp_group)
-
-        if self.has_bias and not self.skip_bias_add:
-            output = self.bias_add(output, self.cast(self.bias, self.compute_dtype))
-        output = self.cast(output, origin_dtype)
-        output = self.reshape(output, output_shape)
-        return output
-
-    def sharded_state_dict(self):
-        """provide the sharded state dict based on the config"""
-        w_shard = (1, self.tensor_parallel_group_size) if self.transpose_b else (self.tensor_parallel_group_size, 1)
-
-        if self.is_expert and self.expert_num > 1:
-            w_shard = (1, 1, self.tensor_parallel_group_size) if self.transpose_b \
-                else (1, self.tensor_parallel_group_size, 1)
-
-        state_dict = {}
-        state_dict[self.weight.name] = {'shape': self.weight.shape,
-                                        'shard': w_shard}
-        if self.has_bias:
-            state_dict[self.bias.name] = {'shape': self.bias.shape,
-                                          'shard': (1,)}
-        return state_dict
-
-
-class VocabParallelEmbedding(nn.Cell):
-    """
-    Embedding parallelized in the vocabulary dimension.
-
-    Args:
-        num_embeddings: vocabulary size.
-        embedding_dim: size of hidden state.
-        parallel_config (Optional[Union[dict, ParallelContextConfig]]):
-            Parallel Config For Running Environment. Default: None.
-        init_method (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The values
-            of str refer to the function `initializer`. Default: 'normal'.
-        init_type (dtype.Number): The parameter initialization type. Default: mstype.float32.
-        tp_group (ProcessGroup): The process_group this linear layer used. Default: default_pgs.
-    """
-
-    def __init__(
-            self,
-            num_embeddings,
-            embedding_dim,
-            parallel_config,
-            init_method="normal",
-            init_type=mstype.float32,
-            tp_group=default_pgs,
-    ):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embedding_dim = embedding_dim
-        self.sequence_parallel = parallel_config.use_sequence_parallel
-
-        self.tp_group = tp_group
-        self.tensor_parallel_group_size = self.tp_group.size
-        rank = self.tp_group.rank
-
-        self.vocab_start_index, self.vocab_end_index = self._vocab_range_from_global_vocab_size(
-            self.num_embeddings, rank, self.tensor_parallel_group_size)
-        self.num_embeddings_per_partition = self.vocab_end_index - self.vocab_start_index
-
-        with get_rng_tracer().rng_fork():
-            self.embedding_weight = Parameter(
-                initializer(
-                    init=init_method,
-                    shape=(self.num_embeddings_per_partition, self.embedding_dim),
-                    dtype=init_type,
-                ),
-                name="embedding_weight",
-            )
-        self.max_index_per_partition = Tensor(self.num_embeddings_per_partition - 1, dtype=mstype.int32)
-        self.expand_dims = ops.ExpandDims()
-        self.gather = ops.Gather()
-
-    def construct(self, x):
-        """
-        Forward of VocabParallelEmbedding.
-        Computes embeddings with optional masking and parallel reduction based on the model parallel size.
-        """
-
-        if self.tensor_parallel_group_size > 1:
-            displaced_x = mint.sub(x, self.vocab_start_index)
-            down_truncated_x = mint.nn.functional.relu(displaced_x)
-            truncated_x = mint.minimum(down_truncated_x, self.max_index_per_partition)
-            input_mask = mint.eq(displaced_x, truncated_x)
-            input_mask = self.expand_dims(input_mask, -1)
-        else:
-            input_mask = None
-            truncated_x = x
-        # Get the embeddings.
-        # 'embedding' has dynamic shape issue, use gather instead now.
-        output_parallel = self.gather(self.embedding_weight, truncated_x, 0)
-        # Mask the output embedding.
-        if self.tensor_parallel_group_size > 1:
-            output_parallel = mint.mul(output_parallel, input_mask)
-
-        if self.sequence_parallel:
-            output_parallel = output_parallel.swapaxes(0, 1).contiguous()
-            output = reduce_scatter_to_model_parallel_region(output_parallel, self.tp_group)
-            output = output.swapaxes(0, 1).contiguous()
-        else:
-            # Reduce across all the model parallel devices.
-            output = reduce_from_model_parallel_region(output_parallel, self.tp_group)
-        return output
-
-    def _vocab_range_from_global_vocab_size(self, global_vocab_size, rank, world_size):
-        if global_vocab_size % world_size != 0:
-            raise ValueError(f"The vocabulary size is {global_vocab_size},"
-                             f"which is not divisible by size of tensor parallel({world_size}).")
-        per_partition_vocab_size = divide(global_vocab_size, world_size)
-        index_f = rank * per_partition_vocab_size
-        index_l = index_f + per_partition_vocab_size
-        return index_f, index_l
-
-    def sharded_state_dict(self):
-        """provide the sharded state dict based on the config"""
-        w_shard = (self.tensor_parallel_group_size, 1)
-        state_dict = {}
-        state_dict[self.embedding_weight.name] = {'shape': self.embedding_weight.shape,
-                                                  'shard': w_shard}
-
-        return state_dict
diff --git a/research/llama3_1/infer/norm.py b/research/llama3_1/infer/norm.py
deleted file mode 100644
index 7c7e5d53fe02e9ebe53d6c65764ed31ff3f10ecb..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/norm.py
+++ /dev/null
@@ -1,92 +0,0 @@
-# Copyright 2025 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-Normalization
-"""
-import mindspore.common.dtype as mstype
-import mindspore.ops.operations as P
-from mindspore import Parameter, nn
-from mindspore.common.initializer import initializer
-
-from mindformers.version_control import check_rmsnorm_big_kernel_valid
-
-
-class RMSNorm(nn.Cell):
-    r"""
-    A self-defined RMSNorm operation using reduce mean.
-
-        Args:
-            dim (tuple): The shape of the input tensor
-            eps (float): The epsilon value of the denominator. Default 1e-5.
-            compute_type: The compute type.
-        Inputs:
-            - **x** (Tensor) - Tensor of shape :math:`(batch, seq\_length, hidden\_size)`.
-
-        Outputs:
-            Tensor of shape :math:`(batch, seq_length, hidden_size)`.
-    """
-
-    def __init__(self, dim, eps=1e-6, compute_type=mstype.float32):
-        super().__init__()
-        self.eps = eps
-        self.compute_type = compute_type
-        self.weight = Parameter(initializer('ones', (dim,), dtype=self.compute_type), parallel_optimizer=False)
-
-        if check_rmsnorm_big_kernel_valid():
-            self.norm = P.RmsNorm(eps)
-            self.rms_norm = self._rms_norm
-            self.self_define = False
-            self.cast = P.Cast()
-            self.rcast = P.Cast()
-        else:
-            self.cast = P.Cast()
-            self.mul = P.Mul()
-            self.mul2 = P.Mul()
-            self.square = P.Square()
-            self.mean = P.ReduceMean(keep_dims=True)
-            self.add = P.Add()
-            self.rsqrt = P.Rsqrt()
-            self.rms_norm = self._self_norm
-            self.self_define = True
-
-    def _self_norm(self, x):
-        original_type = x.dtype
-        norm_factor = self.square(self.cast(x, self.compute_type))
-        norm_factor = self.mean(norm_factor, -1)
-        norm_factor = self.add(norm_factor, self.eps)
-        norm_factor = self.rsqrt(norm_factor)
-        output = self.mul(x, self.cast(norm_factor, original_type))
-        output = self.mul2(output, self.cast(self.weight, original_type))
-        return output
-
-    def _rms_norm(self, x):
-        original_type = x.dtype
-        output = self.norm(self.cast(x, self.compute_type), self.weight)[0]
-        return self.rcast(output, original_type)
-
-    def construct(self, x):
-        """Forward of RMSNorm."""
-        return self.rms_norm(x)
-
-    def sharded_state_dict(self):
-        """provide the sharded state dict based on the config"""
-        w_shard = (1,)
-        state_dict = {}
-        state_dict[self.weight.name] = {'shape': self.weight.shape,
-                                        'shard': w_shard}
-        return state_dict
diff --git a/research/llama3_1/infer/parallel_paged_attention_mgr.py b/research/llama3_1/infer/parallel_paged_attention_mgr.py
deleted file mode 100644
index 76751f1fa7e3a617e2fb43a271e8edd01c856ed3..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/parallel_paged_attention_mgr.py
+++ /dev/null
@@ -1,91 +0,0 @@
-# Copyright 2025 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-Paged Attention Manager for inference.
-"""
-import math
-
-import mindspore.common.dtype as mstype
-from mindspore import ops as P
-from mindspore import nn
-from mindformers.parallel_core.inference.utils import create_empty_parameter
-
-
-class ParallelPagedAttentionMgr(nn.Cell):
-    """Paged Attention Manager."""
-    def __init__(self,
-                 n_heads,
-                 head_dim,
-                 n_kv_heads,
-                 kv_shape,
-                 seq_length=-1,
-                 compute_dtype=mstype.float16,
-                 npu_mem_size=2):
-        super().__init__()
-        self.n_heads = n_heads
-        self.head_dim = head_dim
-        self.n_kv_heads = n_kv_heads
-        self.seq_length = seq_length
-        self.is_first_iteration = True
-        self.scale_value = 1 / math.sqrt(self.head_dim)
-        self.key_cache = None
-        self.value_cache = None
-        self.npu_mem_size = npu_mem_size
-        if self.npu_mem_size > 0:
-            self.key_cache = create_empty_parameter(
-                shape=kv_shape,
-                dtype=compute_dtype,
-                device="Ascend",
-                name="key_cache",
-                requires_grad=False,
-            )
-            self.value_cache = create_empty_parameter(
-                shape=kv_shape,
-                dtype=compute_dtype,
-                device="Ascend",
-                name="value_cache",
-                requires_grad=False,
-            )
-
-        self.reshape_and_cache = P.auto_generate.ReshapeAndCache()
-        self.paged_attention = P.auto_generate.PagedAttention(self.n_heads,
-                                                              self.scale_value,
-                                                              self.n_kv_heads)
-        self.paged_attention_with_alibi = P.auto_generate.PagedAttentionMask(self.n_heads,
-                                                                             self.scale_value,
-                                                                             self.n_kv_heads)
-
-    def construct(self, key, value, slot_mapping, _, key_cache=None, value_cache=None):
-        """The forward compute of KVCache for Paged Attention."""
-        if self.npu_mem_size == -1:
-            return self.reshape_and_cache(key, value, key_cache, value_cache, slot_mapping)
-        return self.reshape_and_cache(key, value, self.key_cache, self.value_cache, slot_mapping)
-
-    def paged_attn(self, query, batch_valid_length, block_tables, attn_mask=None, q_seq_lens=None,
-                   key_cache=None, value_cache=None):
-        if self.npu_mem_size == -1:
-            return self._paged_attn(query, batch_valid_length, block_tables, attn_mask, q_seq_lens,
-                                    key_cache, value_cache)
-        return self._paged_attn(query, batch_valid_length, block_tables, attn_mask, q_seq_lens,
-                                self.key_cache, self.value_cache)
-
-    def _paged_attn(self, query, batch_valid_length, block_tables, attn_mask=None, q_seq_lens=None,
-                    key_cache=None, value_cache=None):
-        """The forward compute of Paged Attention."""
-        return self.paged_attention(query, key_cache, value_cache, block_tables, batch_valid_length,
-                                    None, None, attn_mask, q_seq_lens)
diff --git a/research/llama3_1/infer/random.py b/research/llama3_1/infer/random.py
deleted file mode 100644
index 66b3f91595a653544d1e5b073194ddb16251778b..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/random.py
+++ /dev/null
@@ -1,100 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-RNGStateTracer
-"""
-from contextlib import contextmanager
-try:
-    from mindspore import manual_seed, get_rng_state, set_rng_state
-except ImportError:
-    from mindspore.nn.generator import manual_seed, get_rng_state, set_rng_state
-
-
-DATA_PARALLEL_GENERATOR = "dp_rng_generator"
-TENSOR_PARALLEL_GENERATOR = "tp_rng_generator"
-EXPERT_PARALLEL_GENERATOR = "exp_rng_generator"
-IS_SEED_SET = False
-CANDIDATE_MODES = [DATA_PARALLEL_GENERATOR, TENSOR_PARALLEL_GENERATOR, EXPERT_PARALLEL_GENERATOR]
-
-
-class RNGStateTracer:
-    """
-    Examples:
-        >>> with rngstatetracer.rng_fork():
-        >>>     tensor = mint.normal(mean, std)
-        >>> ...
-    """
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self._states = {}
-
-    def set_state(self, states):
-        self._states = states
-
-    def get_state(self):
-        states = {}
-        for mode in self._states:
-            states[mode] = self._states[mode]
-        return states
-
-    def init_mode(self, mode, seed):
-        "initialize mode and seed where mode should not be duplicate, otherwise reset should be called first"
-        if mode in self._states:
-            # if mode exists, raise exception
-            raise ValueError(f"init generator with existed mode {mode}")
-        # save current state, set and record target state, then restore old state
-        orig_rng_state = get_rng_state()
-        manual_seed(seed)
-        self._states[mode] = get_rng_state()
-        set_rng_state(orig_rng_state)
-
-    # pylint: disable=W0101
-    @contextmanager
-    def rng_fork(self, mode=TENSOR_PARALLEL_GENERATOR):
-        "fork rng state if seed is already set, otherwise keep the rng state unchanged"
-        if not IS_SEED_SET:
-            yield
-            return
-        # if mode not exists, raise exception
-        if mode not in self._states:
-            raise ValueError(f"not initialize or the parallel mode {mode} not exists ")
-        # save current state, then set target state
-        orig_rng_state = get_rng_state()
-        set_rng_state(self._states[mode])
-        try:
-            # yield to do job
-            yield
-        finally:
-            # restore old state
-            self._states[mode] = get_rng_state()
-            set_rng_state(orig_rng_state)
-
-default_rng_tracer_ = None
-
-
-def _init_default_rng_tracer():
-    global default_rng_tracer_
-    default_rng_tracer_ = RNGStateTracer()
-
-
-def get_rng_tracer():
-    if default_rng_tracer_ is None:
-        _init_default_rng_tracer()
-    return default_rng_tracer_
diff --git a/research/llama3_1/infer/scale_mask_softmax.py b/research/llama3_1/infer/scale_mask_softmax.py
deleted file mode 100644
index 1bd2c7561c1470cba4b8d0a4f4627b0037bbc5ce..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/scale_mask_softmax.py
+++ /dev/null
@@ -1,68 +0,0 @@
-# Copyright 2025 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-ScaleMaskSoftmax
-"""
-import mindspore.ops.functional as F
-
-from mindspore import mint, ops, nn
-from mindspore.common import dtype as mstype
-
-
-class ScaleMaskSoftmax(nn.Cell):
-    r"""
-    fused operation: scaling + mask + softmax
-
-    Args:
-        mask_func: mask function to be applied.
-        scale: scaling factor used in input tensor scaling.
-        softmax_compute_type: softmax in performed precision.
-
-    Inputs:
-        - **x** (Tensor) - The input tensor
-        - **mask** (Tensor) - The mask tensor
-
-    Outputs:
-        - The output tensor.
-    """
-
-    def __init__(self, mask_func, scale=None, softmax_compute_type=mstype.float32):
-        super().__init__()
-        self.mask_func = mask_func
-        self.softmax_compute_type = softmax_compute_type
-        self.scale = scale
-
-        if self.scale is not None and self.softmax_compute_type != mstype.float32:
-            raise ValueError("softmax should be in fp32 when scaled")
-
-    def construct(self, x, mask):
-        """construct method"""
-        origin_dtype = F.dtype(x)
-        if self.softmax_compute_type != origin_dtype:
-            x = ops.cast(x, self.softmax_compute_type)
-
-        if self.scale is not None:
-            x = x * self.scale
-        masked_input = self.mask_func(x, mask) if mask is not None else x
-
-        probs = mint.nn.functional.softmax(masked_input, dim=-1)
-
-        if self.softmax_compute_type != origin_dtype:
-            probs = ops.cast(probs, origin_dtype)
-
-        return probs
diff --git a/research/llama3_1/infer/transformer.py b/research/llama3_1/infer/transformer.py
deleted file mode 100644
index 45a194e2ec54c4f41964b36127d7a45c0523abee..0000000000000000000000000000000000000000
--- a/research/llama3_1/infer/transformer.py
+++ /dev/null
@@ -1,888 +0,0 @@
-# Copyright 2025 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-For transformer
-"""
-import math
-import os
-
-import numpy as np
-
-import mindspore.common.dtype as mstype
-from mindspore import Parameter, Tensor, mint, nn, ops
-from mindspore.common.initializer import initializer
-
-from mindformers.parallel_core.inference.utils import divide, get_attn_mask_func
-from mindformers.parallel_core.inference.transformer.activation import get_act_func
-from mindformers.parallel_core.process_group_config import default_model_comm_pgs
-from mindformers.modules.flash_attention import FlashAttention
-from mindformers.modules.infer_attention import InferRotaryEmbedding
-from mindformers.modules.layers import FreqsMgr, RotaryEmbedding
-from mindformers.modules.transformer import LowerTriangularMaskWithDynamic
-from mindformers.version_control import need_nz
-
-from research.llama3_1.infer.norm import RMSNorm
-from research.llama3_1.infer.parallel_paged_attention_mgr import ParallelPagedAttentionMgr
-from research.llama3_1.infer.scale_mask_softmax import ScaleMaskSoftmax
-from research.llama3_1.infer.layers import ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
-
-
-class VocabEmbedding(nn.Cell):
-    """
-    Embedding Layer.
-
-    Args:
-            - **num_embeddings** (int): Size of the dictionary of embeddings.
-            - **embedding_dim** (int): The size of each embedding vector.
-            - **param_init_type** (mstype): The param init type, default mstype.float32.
-            - **param_init** (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the embedding_table.
-                Refer to class `initializer` for the values of string when a string
-                is specified. Default: 'normal'.
-    Inputs:
-            - **input_ids** (Tensor) - The tokenized inputs with datatype int32 with shape (batch_size, seq_length)
-
-    Outputs:
-            - **output** (Tensor) - The embedding vector for the input with shape (batch_size,
-              seq_length, embedding_size).
-    """
-
-    def __init__(self, num_embeddings, embedding_dim, param_init_type=mstype.float32, param_init='normal',
-                 parallel_optimizer=False):
-        super().__init__()
-        self.num_embeddings = num_embeddings
-        self.embedding_dim = embedding_dim
-        self.embedding_weight = Parameter(
-            initializer(param_init, [self.num_embeddings, self.embedding_dim], dtype=param_init_type),
-            name='embedding_weight', parallel_optimizer=parallel_optimizer)
-        self.gather = ops.Gather()
-
-    def construct(self, input_ids):
-        """Forward of vocab embedding."""
-        # 'embedding' has dynamic shape issue, use gather instead now.
-        output = self.gather(self.embedding_weight, input_ids, 0)
-        return output
-
-
-class ParallelMLP(nn.Cell):
-    r"""
-    Implementation of parallel feedforward block.
-
-    Args:
-        config (dict): Configuration.
-        is_expert (book): This block is an expert block. Default: False.
-        model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group.
-            Default: default_model_comm_pgs.
-
-    Inputs:
-        - **hidden_states** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-
-    Outputs:
-        - **output** (Tensor) - Output tensor of shape :math:`(B, S, H)`.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(self, config, is_expert=False, model_comm_pgs=default_model_comm_pgs):
-        super().__init__(config)
-        if is_expert:
-            raise NotImplementedError("For ParallelMLP, `is_expert` is not supported for now.")
-        self.config = config
-        self.has_bias = self.config.mlp_has_bias
-        self.hidden_size = self.config.hidden_size
-        self.ffn_hidden_size = self.config.ffn_hidden_size
-        self.mlp_has_gate = self.config.mlp_has_gate
-        self.ffn_concat = self.config.ffn_concat
-
-        self.tp = model_comm_pgs.tp
-        tp_group_size = self.tp.size
-        self.ffn_hidden_size_per_partition = divide(self.ffn_hidden_size, tp_group_size)
-
-        if self.mlp_has_gate:
-            if self.ffn_concat:
-                self.w_gate_hidden = ColumnParallelLinear(
-                    self.hidden_size,
-                    self.ffn_hidden_size * 2,
-                    config=self.config.parallel_config,
-                    bias=self.has_bias,
-                    transpose_b=True,
-                    gather_output=False,
-                    is_expert=is_expert,
-                    param_init_type=self.config.param_init_dtype,
-                    compute_dtype=self.config.compute_dtype,
-                    tp_group=self.tp,
-                )
-            else:
-                self.w1 = ColumnParallelLinear(
-                    self.hidden_size,
-                    self.ffn_hidden_size,
-                    config=self.config.parallel_config,
-                    bias=self.has_bias,
-                    transpose_b=True,
-                    gather_output=False,
-                    is_expert=is_expert,
-                    param_init_type=self.config.param_init_dtype,
-                    compute_dtype=self.config.compute_dtype,
-                    tp_group=self.tp,
-                )
-                self.w3 = ColumnParallelLinear(
-                    self.hidden_size,
-                    self.ffn_hidden_size,
-                    config=self.config.parallel_config,
-                    bias=self.has_bias,
-                    transpose_b=True,
-                    gather_output=False,
-                    is_expert=is_expert,
-                    param_init_type=self.config.param_init_dtype,
-                    compute_dtype=self.config.compute_dtype,
-                    tp_group=self.tp,
-                )
-        else:
-            self.w1 = ColumnParallelLinear(
-                self.hidden_size,
-                self.ffn_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.has_bias,
-                transpose_b=True,
-                gather_output=False,
-                is_expert=is_expert,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-                tp_group=self.tp,
-            )
-
-        self.act_type = self.config.hidden_act
-        self.act_func = get_act_func(self.act_type)
-
-        # Project back to h.
-        self.w2 = RowParallelLinear(
-            self.ffn_hidden_size,
-            self.hidden_size,
-            input_is_parallel=True,
-            config=self.config.parallel_config,
-            bias=self.has_bias,
-            transpose_b=True,
-            is_expert=is_expert,
-            param_init_type=self.config.param_init_dtype,
-            compute_dtype=self.config.compute_dtype,
-            tp_group=self.tp,
-        )
-        self.mul = ops.Mul()
-        self.reshape = ops.Reshape()
-
-    def construct(self, x):
-        """ Construct function of mlp block. """
-        # [B, S, H] -> [B, S, ffn_H]
-        if self.mlp_has_gate:
-            if self.ffn_concat:
-                gate_hidden_out = self.w_gate_hidden(x)  # dp,1 -> dp, mp  # dp,1 -> dp, mp
-                gate_hidden_out_shape = gate_hidden_out.shape
-                reshape_out = self.reshape(gate_hidden_out,
-                                           (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition, 2))
-                gate, hidden = mint.split(reshape_out,
-                                          (1, 1), -1)
-                gate = self.reshape(gate, (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition))
-                hidden = self.reshape(hidden, (*gate_hidden_out_shape[:-1], self.ffn_hidden_size_per_partition))
-            else:
-                gate = self.w1(x)  # dp,1 -> dp, mp
-                hidden = self.w3(x)  # dp,1 -> dp, mp
-            gate = self.act_func(gate)
-            hidden = mint.mul(hidden, gate)
-        else:
-            hidden = self.w1(x)
-            hidden = self.act_func(hidden)
-
-        # [B, S, ffn_H] -> [B, S, H]
-        output = self.w2(hidden)
-        return output
-
-
-class CoreAttention(nn.Cell):
-    r"""
-    Get the weighted score along the seq_length.
-
-    Args:
-        layer_number (int): Number which indicates the index of this transformer layer in the
-            whole transformer block.
-        config (dict): Configuration.
-        attn_type (str): Attention type. Support ['self_attn', 'cross_attn']. Default: 'self_attn'.
-
-    Inputs:
-        - **query** (Tensor) - Tensor of query matrix.
-        - **key** (Tensor) - Tensor of key matrix.
-        - **value** (Tensor) - Tensor of value matrix.
-        - **attention_mask** (Tensor) - Tensor of attention mask matrix.
-
-    Outputs:
-        - **attn_output** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(self, layer_number, config, attn_mask_type=None):
-        super(CoreAttention, self).__init__()
-        if attn_mask_type:
-            raise NotImplementedError("For CoreAttention, `attn_mask_type` is not supported for now.")
-        self.config = config
-        self.layer_index = max(1, layer_number)
-        self.compute_dtype = self.config.compute_dtype
-        self.softmax_compute_dtype = self.config.softmax_compute_dtype
-        self.sequence_parallel = self.config.parallel_config.use_sequence_parallel
-        self.apply_query_key_layer_scaling = self.config.apply_query_key_layer_scaling
-        self.num_heads = self.config.num_heads
-        self.hidden_size = self.config.hidden_size
-        self.head_dim = divide(self.hidden_size, self.num_heads)
-
-        coeff = None
-        norm_factor = math.sqrt(self.head_dim)
-        if self.apply_query_key_layer_scaling:
-            coeff = self.layer_index
-            norm_factor *= coeff
-        self.inv_norm_factor = Tensor(1.0 / norm_factor, dtype=self.compute_dtype)
-
-        self.mask_func = get_attn_mask_func(self.config.mask_func_type)
-        self.scale_mask_softmax = ScaleMaskSoftmax(self.mask_func,
-                                                   softmax_compute_type=self.softmax_compute_dtype)
-
-        self.attention_dropout = mint.nn.Dropout(p=self.config.attention_dropout_rate)
-
-    def construct(self, query_layer, key_layer, value_layer, attention_mask):
-        """
-        Computes the attention scores, applies the attention mask, and returns the weighted
-        sum of the value layer based on the attention probabilities.
-
-        Inputs:
-        ----------
-        query_layer : Tensor
-            The query tensor of shape [B, N, S_q, D].
-        key_layer : Tensor
-            The key tensor of shape [B, N, S_k, D].
-        value_layer : Tensor
-            The value tensor of shape [B, N, S_k, D].
-        attention_mask : Tensor
-            The attention mask tensor of shape [B, N, S_q, S_k].
-
-        Returns:
-        -------
-        Tensor
-            The attention output tensor of shape [B, N, S_q, D].
-        """
-        # score shape: [B, N, S_q, S_k]
-        score = ops.bmm(query_layer, key_layer.transpose(0, 1, 3, 2))
-        score = mint.mul(score, self.inv_norm_factor)
-
-        # attention scores and attention mask [B, N, S_q, S_k]
-        attention_probs = self.scale_mask_softmax(score, attention_mask)
-
-        attention_probs = self.attention_dropout(attention_probs)
-
-        # [B, N, S_q, S_k] * [B, N, S_v, D] -> [B, N, S_q, D]
-        weighted_values = ops.bmm(attention_probs, value_layer)
-
-        return weighted_values
-
-
-class ParallelAttention(nn.Cell):
-    r"""
-    Parallel attention block.
-
-    Args:
-        layer_index (int): Number which indicates the index of this transformer layer in the
-            whole transformer block.
-        config (dict): Configuration.
-        attn_type (str): Attention type. Support ['self_attn', 'cross_attn']. Default: 'self_attn'.
-        model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group.
-            Default: default_model_comm_pgs.
-
-    Inputs:
-        - **hidden_states** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-        - **attention_mask** (Tensor) - Tensor of attention mask.
-        - **encoder_output** (Tensor) - Tensor of encoder output used for cross attention. Default: None.
-        - **rotary_pos_emb** (Tensor) - Tensor of rotary position embedding. Default: None.
-
-    Outputs:
-        - **output** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(self, config, layer_number, attention_type="self_attn", attn_mask_type=None,
-                 model_comm_pgs=default_model_comm_pgs):
-        super().__init__(config)
-        if attn_mask_type:
-            raise NotImplementedError("For ParallelAttention, `attn_mask_type` is not supported for now.")
-        self.config = config
-        self.layer_index = max(1, layer_number)
-        self.param_init_dtype = self.config.param_init_dtype
-        self.compute_dtype = self.config.compute_dtype
-        self.is_first_iteration = True
-        self.use_past = self.config.use_past
-        self.qkv_concat = self.config.qkv_concat
-
-        self.attn_type = attention_type
-        self.num_heads = self.config.num_heads
-        self.kv_num_heads = self.num_heads if config.n_kv_heads is None else config.n_kv_heads
-        self.hidden_size = self.config.hidden_size
-        self.head_dim = divide(self.hidden_size, self.num_heads)
-        self.kv_hidden_size = self.head_dim * self.kv_num_heads
-        self.n_rep = divide(self.num_heads, self.kv_num_heads)
-
-        self.sequence_parallel = self.config.parallel_config.use_sequence_parallel
-        self.use_flash_attention = self.config.use_flash_attention
-        self.norm_factor = math.sqrt(self.head_dim)
-
-        self.tp = model_comm_pgs.tp
-        self.tp_group_size = self.tp.size
-        self.num_heads_per_partition = divide(self.num_heads, self.tp_group_size)
-
-        self.use_gqa = (self.num_heads != self.kv_num_heads)
-
-        if self.use_gqa:
-            self._check_gqa_valid()
-            self.kv_num_heads_per_partition = divide(self.kv_num_heads, self.tp_group_size)
-            self.repeat_num = divide(self.num_heads, self.kv_num_heads)
-        else:
-            self.kv_num_heads_per_partition = self.num_heads_per_partition
-
-        if self.attn_type == "self_attn":
-            self._init_self_attn()
-        elif self.attn_type == "cross_attn":
-            self._init_cross_attn()
-        else:
-            raise NotImplementedError(
-                f"attention_type(str) should be 'self_attn' or 'cross_attn', but got {self.attn_type}")
-        self.reshape = ops.Reshape()
-        self.cast = ops.Cast()
-        self.wo = RowParallelLinear(
-            self.hidden_size,
-            self.hidden_size,
-            input_is_parallel=True,
-            config=self.config.parallel_config,
-            bias=self.config.out_proj_has_bias,
-            transpose_b=True,
-            param_init_type=self.config.param_init_dtype,
-            compute_dtype=self.config.compute_dtype,
-            tp_group=self.tp,
-        )
-
-        if self.use_flash_attention:
-            input_layout = "TH" if self.use_past else "BNSD"
-            self.flash_attention = FlashAttention(head_num=self.num_heads_per_partition,
-                                                  scale_value=1.0 / self.norm_factor,
-                                                  next_tokens=0,
-                                                  input_layout=input_layout)
-        else:
-            self.core_attention = CoreAttention(self.layer_index, self.config)
-
-        if self.use_past:
-            if need_nz():
-                kv_shape = (self.config.num_blocks, self.config.block_size,
-                            self.kv_num_heads_per_partition * self.head_dim)
-            else:
-                kv_shape = (self.config.num_blocks, self.config.block_size,
-                            self.kv_num_heads_per_partition, self.head_dim)
-            self.npu_mem_size = config.npu_mem_size if hasattr(config, "npu_mem_size") else 2
-            self.paged_attention_mgr = ParallelPagedAttentionMgr(self.num_heads_per_partition,
-                                                                 self.head_dim,
-                                                                 self.kv_num_heads_per_partition,
-                                                                 kv_shape,
-                                                                 config.seq_length,
-                                                                 compute_dtype=self.compute_dtype,
-                                                                 npu_mem_size=self.npu_mem_size)
-            self.rotary_embedding = InferRotaryEmbedding(rotary_cos_format=2)
-        else:
-            self.apply_rotary_emb = RotaryEmbedding(self.head_dim, config.rotary_dtype)
-
-    def construct(self, x, batch_valid_length, block_tables, slot_mapping, freqs_cis=None,
-                  attn_mask=None, alibi_mask=None, encoder_output=None, prefix_keys_values=None,
-                  q_seq_lens=None, key_cache=None, value_cache=None):
-        """Construct function of attention block."""
-        # hidden states: [B, S, H]
-        # apply query, key, value projection
-        if self.attn_type == "self_attn":
-            if self.qkv_concat:
-                qkv = self.cast(self.w_qkv(x), self.compute_dtype)
-                reshape_qkv = self.reshape(qkv,
-                                           (-1,
-                                            self.kv_num_heads_per_partition,
-                                            (self.n_rep + 2) * self.head_dim))
-                query, key, value = mint.split(reshape_qkv,
-                                               (self.head_dim * self.n_rep,
-                                                self.head_dim,
-                                                self.head_dim), -1)
-                if self.use_past:
-                    query = self.reshape(query, (-1, self.hidden_size_per_partition))
-                    key = self.reshape(key, (-1, self.kv_hidden_size_per_partition))
-                    value = self.reshape(value, (-1, self.kv_hidden_size_per_partition))
-            else:
-                query = self.cast(self.wq(x), self.compute_dtype)
-                key = self.cast(self.wk(x), self.compute_dtype)
-                value = self.cast(self.wv(x), self.compute_dtype)
-                if not self.use_past:
-                    # [B, S, H] --> [B, S, N, D]
-                    bs, seq_len, _ = x.shape
-                    query = self.reshape(query, (bs, seq_len, self.num_heads_per_partition, self.head_dim))
-                    key = self.reshape(key, (bs, seq_len, self.kv_num_heads_per_partition, self.head_dim))
-                    value = self.reshape(value, (bs, seq_len, self.kv_num_heads_per_partition, self.head_dim))
-        else:
-            query = self.cast(self.wq(x), self.compute_dtype)
-            if self.qkv_concat:
-                kv = self.cast(self.w_kv(encoder_output), self.compute_dtype)
-                key, value = mint.split(kv, (self.kv_hidden_size_per_partition, self.kv_hidden_size_per_partition), -1)
-            else:
-                key = self.cast(self.wk(encoder_output), self.compute_dtype)
-                value = self.cast(self.wv(encoder_output), self.compute_dtype)
-
-        # qkv shape: [B, S, H]
-        if self.use_past:
-            if freqs_cis is not None:
-                query, key = self.rotary_embedding(query, key, freqs_cis, batch_valid_length)
-
-            if prefix_keys_values is not None:
-                prefix_len = prefix_keys_values.shape[2]
-                slot_mapping = slot_mapping + self.cast(mint.ne(slot_mapping, -1), mstype.int32) * prefix_len
-                if self.is_first_iteration:
-                    key, value = self._cat_prefix(key, value, prefix_keys_values)
-
-            key_out = self.paged_attention_mgr(key, value, slot_mapping, batch_valid_length,
-                                               key_cache=key_cache, value_cache=value_cache)
-            query = ops.depend(query, key_out)
-
-            if self.is_first_iteration:
-                if self.use_flash_attention:
-                    context_layer = self.flash_attention(query, key, value, attn_mask, alibi_mask, None, None,
-                                                         q_seq_lens, batch_valid_length)
-                else:
-                    bs, seq_len, _ = x.shape
-                    # [B, S, H] --> [B, S, N, D]
-                    query = query.reshape(bs, seq_len, -1, self.head_dim)
-                    key = key.reshape(bs, seq_len, -1, self.head_dim)
-                    value = value.reshape(bs, seq_len, -1, self.head_dim)
-                    # [B, S, N_kv, D] --> [B, S, N, D]
-                    if self.use_gqa:
-                        key = mint.repeat_interleave(key, repeats=self.repeat_num, dim=2)
-                        value = mint.repeat_interleave(value, repeats=self.repeat_num, dim=2)
-                    # [B, S, N, D] --> [B, N, S, D]
-                    query = query.transpose(0, 2, 1, 3)
-                    key = key.transpose(0, 2, 1, 3)
-                    value = value.transpose(0, 2, 1, 3)
-                    context_layer = self.core_attention(query, key, value, attn_mask)
-                    # [B, N, S, D] --> [B, S, H]
-                    context_layer = context_layer.transpose(0, 2, 1, 3).reshape(
-                        bs, seq_len, self.hidden_size_per_partition)
-            else:
-                context_layer = self.paged_attention_mgr.paged_attn(query, batch_valid_length, block_tables,
-                                                                    attn_mask, q_seq_lens, key_cache, value_cache)
-
-        # qkv shape: [B, S, N, D]
-        else:
-            bs, seq_len, _ = x.shape
-            # [B, S, N, D] --> [B, N, S, D]
-            query = query.transpose(0, 2, 1, 3)
-            key = key.transpose(0, 2, 1, 3)
-            value = value.transpose(0, 2, 1, 3)
-            if freqs_cis is not None:
-                query, key = self.apply_rotary_emb(query, key, freqs_cis)
-            if self.use_flash_attention:
-                if os.getenv('RUN_MODE') == 'predict':
-                    raise NotImplementedError(
-                        "Conflict detected in predict mode: "
-                        "Flash Attention is incompatible when use_past=False")
-                context_layer = self.flash_attention(query, key, value, attn_mask)
-            else:
-                # [B, N_kv, S, D] --> [B, N, S, D]
-                if self.use_gqa:
-                    key = mint.repeat_interleave(key, repeats=self.repeat_num, axis=1)
-                    value = mint.repeat_interleave(value, repeats=self.repeat_num, axis=1)
-                context_layer = self.core_attention(query, key, value, attn_mask)
-            # [B, N, S, D] --> [B, S, H]
-            context_layer = context_layer.transpose(0, 2, 1, 3).reshape(
-                bs, seq_len, self.hidden_size_per_partition)
-
-        # apply output projection
-        output = self.wo(context_layer)
-        output = self.cast(output, x.dtype)
-
-        return output
-
-    def _cat_prefix(self, key, value, prefix_keys_values):
-        """
-        concat prefix_keys_values to key and value
-        prefix_keys_values: shape(2, bs, pre_len, num_heads * kv_channels)
-        """
-        if prefix_keys_values is not None:
-            past_key = prefix_keys_values[0]
-            past_value = prefix_keys_values[1]
-            past_key = self.cast(past_key, key.dtype)
-            past_value = self.cast(past_value, value.dtype)
-            key = ops.concat((past_key, key), 1)
-            value = ops.concat((past_value, value), 1)
-        return key, value
-
-    def _check_gqa_valid(self):
-        """check whether the config is valid for grouped-query-attention"""
-        if self.num_heads % self.kv_num_heads != 0:
-            raise ValueError(
-                f"num_heads must be divisible by kv_num_heads, "
-                f"but got num_heads {self.num_heads} and kv_num_heads {self.kv_num_heads}"
-            )
-        if self.kv_num_heads % self.tp_group_size != 0:
-            raise ValueError(
-                f"kv_num_heads must be divisible by tp_group_size, "
-                f"but got kv_num_heads {self.kv_num_heads} and kv_num_heads {self.tp_group_size}"
-            )
-
-    def _init_self_attn(self):
-        """init qkv linears of self-attention"""
-        self.hidden_size_per_partition = divide(self.hidden_size, self.tp_group_size)
-        self.kv_hidden_size_per_partition = divide(self.kv_hidden_size, self.tp_group_size)
-        if self.qkv_concat:
-            self.w_qkv = ColumnParallelLinear(
-                self.hidden_size,
-                self.hidden_size + 2 * self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-                tp_group=self.tp,
-            )
-        else:
-            self.wq = ColumnParallelLinear(
-                self.hidden_size,
-                self.hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-                tp_group=self.tp,
-            )
-            self.wk = ColumnParallelLinear(
-                self.hidden_size,
-                self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-                tp_group=self.tp,
-            )
-            self.wv = ColumnParallelLinear(
-                self.hidden_size,
-                self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-                tp_group=self.tp,
-            )
-
-    def _init_cross_attn(self):
-        """init qkv linears of cross-attention"""
-        if self.hidden_size != self.kv_hidden_size:
-            raise ValueError("hidden_size must be equal to kv_hidden_size!")
-        self.wq = ColumnParallelLinear(
-            self.hidden_size,
-            self.hidden_size,
-            config=self.config.parallel_config,
-            bias=self.config.qkv_has_bias,
-            gather_output=False,
-            transpose_b=True,
-            param_init_type=self.config.param_init_dtype,
-            compute_dtype=self.config.compute_dtype,
-        )
-        if self.qkv_concat:
-            self.w_kv = ColumnParallelLinear(
-                self.hidden_size,
-                2 * self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-            )
-        else:
-            self.wk = ColumnParallelLinear(
-                self.hidden_size,
-                self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-            )
-            self.wv = ColumnParallelLinear(
-                self.hidden_size,
-                self.kv_hidden_size,
-                config=self.config.parallel_config,
-                bias=self.config.qkv_has_bias,
-                gather_output=False,
-                transpose_b=True,
-                param_init_type=self.config.param_init_dtype,
-                compute_dtype=self.config.compute_dtype,
-            )
-
-
-class ParallelTransformerLayer(nn.Cell):
-    r"""
-    Single parallel transformer layer.
-
-    Args:
-        config (dict): Configuration.
-        layer_index (int): Number which indicates the index of this transformer layer in the
-            whole transformer block.
-        model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group.
-            Default: default_model_comm_pgs.
-
-    Inputs:
-        - **x** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-        - **attention_mask** (Tensor) - Tensor of attention mask.
-        - **rotary_pos_emb** (Tensor) - Tensor of rotary position embedding. Default: None.
-
-    Outputs:
-        - **output** (Tensor) - Tensor of shape :math:`(B, S, H)`.
-
-    Supported Platforms:
-        ``Ascend``
-    """
-
-    def __init__(
-            self,
-            config,
-            layer_number: int,
-            layer_type=None,
-            self_attn_mask_type=None,
-            drop_path_rate: float = 0.0,
-            model_comm_pgs=default_model_comm_pgs,
-    ):
-        super().__init__(config)
-        if layer_type:
-            raise NotImplementedError("For ParallelTransformerLayer, only decoder only structure is supported for now.")
-        if self_attn_mask_type:
-            raise NotImplementedError("For ParallelTransformerLayer, `self_attn_mask_type` is not supported for now.")
-        if drop_path_rate > 0.0:
-            raise NotImplementedError(
-                "For ParallelTransformerLayer, `drop_path_rate > 0` is not supported for now, "
-                "but got `drop_path_rate={}`".format(drop_path_rate)
-            )
-        self.config = config
-        self.apply_residual_connection_post_norm = self.config.apply_residual_connection_post_norm
-        # Normalize the input data.
-        self.attention_norm = RMSNorm(dim=config.hidden_size,
-                                      eps=config.layernorm_epsilon,
-                                      compute_type=config.layernorm_compute_dtype)
-        # Attention.
-        self.attention = ParallelAttention(config, layer_number, model_comm_pgs=model_comm_pgs)
-        # Normalize the attention output
-        self.ffn_norm = RMSNorm(dim=config.hidden_size,
-                                eps=config.layernorm_epsilon,
-                                compute_type=config.layernorm_compute_dtype)
-        # MLP
-        self.feed_forward = ParallelMLP(config, model_comm_pgs=model_comm_pgs)
-
-    def construct(self, x, freqs_cis=None, mask=None, batch_valid_length=None, block_tables=None,
-                  slot_mapping=None, prefix_keys_values=None, q_seq_lens=None, key_cache=None, value_cache=None):
-        """Construct function of transformer layer."""
-        # hidden states: [B, S, H]
-        # norm at the beginning of the transformer layer.
-        norm_output = self.attention_norm(x)
-        # attention.
-        attention_output = self.attention(norm_output, batch_valid_length, block_tables, slot_mapping, freqs_cis,
-                                          mask, prefix_keys_values=prefix_keys_values,
-                                          q_seq_lens=q_seq_lens, key_cache=key_cache, value_cache=value_cache)
-        # residual-connection.
-        if self.apply_residual_connection_post_norm:
-            residual = norm_output
-        else:
-            residual = x
-        norm_input = ops.add(residual, attention_output)
-        # layernorm post attention.
-        norm_output = self.ffn_norm(norm_input)
-        # MLP.
-        mlp_output = self.feed_forward(norm_output)
-        # residual-connection.
-        if self.apply_residual_connection_post_norm:
-            residual = norm_output
-        else:
-            residual = norm_input
-        output = ops.add(residual, mlp_output)
-        return output
-
-
-class ParallelTransformer(nn.Cell):
-    r"""
-    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`ParallelTransformerLayer`]
-    Args:
-        config: the config of transformer
-        model_comm_pgs (ModelCommProcessGroups, optional): Model communication process group.
-            Default: default_model_comm_pgs.
-
-    Returns:
-            output: Tensor, the output of transformerlayer
-    """
-
-    def __init__(
-            self,
-            config,
-            model_type=None,
-            layer_type=None,
-            self_attn_mask_type=None,
-            post_norm: bool = True,
-            pre_process=False,
-            post_process=False,
-            drop_path_rate: float = 0.0,
-            model_comm_pgs=default_model_comm_pgs,
-    ):
-        super().__init__(config)
-        if model_type:
-            raise NotImplementedError("For ParallelTransformer, 'model_type' is not support for now.")
-        if layer_type:
-            raise NotImplementedError("For ParallelTransformer, 'layer_type' is not support for now.")
-        if self_attn_mask_type:
-            raise NotImplementedError("For ParallelTransformer, 'self_attn_mask_type' is not support for now.")
-        if pre_process:
-            raise NotImplementedError("For ParallelTransformer, 'pre_process' is not support for now.")
-        if post_process:
-            raise NotImplementedError("For ParallelTransformer, 'post_process' is not support for now.")
-        if drop_path_rate:
-            raise NotImplementedError("For ParallelTransformer, 'drop_path_rate' is not support for now.")
-        self.config = config
-        self.post_norm = post_norm
-        self.head_dim = config.hidden_size // config.num_heads
-        self.num_layers = config.num_layers
-        self.use_past = config.use_past
-        self.is_first_iteration = True
-        self.use_flash_attention = config.use_flash_attention
-        self.compute_dtype = config.compute_dtype
-
-        self.cast = ops.Cast()
-        self.shape = ops.Shape()
-
-        self.freqs_mgr = FreqsMgr(head_dim=self.head_dim,
-                                  seq_length=config.seq_length,
-                                  max_position_embedding=config.max_position_embedding,
-                                  rotary_dtype=config.rotary_dtype,
-                                  theta=config.theta,
-                                  scaling_factor=config.scaling_factor,
-                                  extend_method=config.extend_method,
-                                  parallel_config=config.parallel_config,
-                                  is_dynamic=config.is_dynamic)
-        self.casual_mask = LowerTriangularMaskWithDynamic(seq_length=config.seq_length,
-                                                          compute_type=config.compute_dtype,
-                                                          is_dynamic=config.is_dynamic,
-                                                          pad_token_id=config.pad_token_id,
-                                                          use_flash_attention=config.use_flash_attention,
-                                                          use_attn_mask_compression=config.use_attn_mask_compression,
-                                                          use_past=config.use_past)
-
-        self.tp = model_comm_pgs.tp
-        self.tp_group_size = self.tp.size
-        if config.parallel_config.vocab_emb_dp or self.tp_group_size == 1:
-            self.tok_embeddings = VocabEmbedding(
-                num_embeddings=config.vocab_size,
-                embedding_dim=config.hidden_size,
-                param_init_type=config.param_init_dtype,
-                param_init="normal",
-            )
-        else:
-            self.tok_embeddings = VocabParallelEmbedding(num_embeddings=config.vocab_size,
-                                                         embedding_dim=config.hidden_size,
-                                                         parallel_config=config.parallel_config,
-                                                         init_method="normal",
-                                                         init_type=config.param_init_dtype,
-                                                         tp_group=self.tp)
-
-        self.layers = nn.CellList()
-        for layer_id in range(config.num_layers):
-            layer = ParallelTransformerLayer(
-                config=self.config,
-                layer_number=layer_id + 1,
-                model_comm_pgs=model_comm_pgs
-            )
-            self.layers.append(layer)
-
-        if self.post_norm:
-            # final layernorm before output.
-            self.norm_out = RMSNorm(dim=config.hidden_size,
-                                    eps=config.layernorm_epsilon,
-                                    compute_type=config.layernorm_compute_dtype)
-
-    def construct(self, tokens: Tensor, batch_valid_length=None, batch_index=None, zactivate_len=None,
-                  block_tables=None, slot_mapping=None, prefix_keys_values=None, position_ids=None, attention_mask=None,
-                  q_seq_lens=None, key_cache=None, value_cache=None):
-        """
-        Forward of ParallelTransformer.
-
-        Args:
-            tokens: the tokenized inputs with datatype int32
-            batch_valid_length(Tensor): the past calculated the index with datatype int32, used for incremental
-                prediction. Tensor of shape :math:`(batch_size,)`. Default None.
-            block_tables (Tensor[int64]): Store mapping tables for each sequence.
-            slot_mapping (Tensor[int32]): Store token cache physical slot index.
-        Returns:
-            output: Tensor, the output of ParallelTransformer
-        """
-        # preprocess
-        mask = attention_mask
-        if self.use_past:
-            if self.is_first_iteration:
-                freqs_cis = self.freqs_mgr.prefill()
-
-                if prefix_keys_values is not None:
-                    bs, seq_len = self.shape(tokens)
-                    if mask is None:
-                        mask = self.casual_mask(tokens)
-                    prefix_length = prefix_keys_values[0].shape[2]
-                    prefix_mask = Tensor(np.zeros((bs, 1, seq_len, prefix_length)), dtype=mask.dtype)
-                    mask = self.concat((prefix_mask, mask))
-            else:
-                freqs_cis = self.freqs_mgr.chunk_with_decode(position_ids)
-        else:
-            bs, seq_len = self.shape(tokens)
-            mask = self.casual_mask(tokens)
-            freqs_cis = self.freqs_mgr(seq_len)
-            if prefix_keys_values is not None:
-                prefix_length = prefix_keys_values[0].shape[2]
-                prefix_mask = Tensor(np.zeros((bs, 1, seq_len, prefix_length)), dtype=mask.dtype)
-                mask = self.concat((prefix_mask, mask))
-
-        # tokens shape: [bs, seq / 1]
-        hidden_states = self.cast(self.tok_embeddings(tokens), self.compute_dtype)
-        # hidden states shape: [bs, seq / 1, hidden_dim]
-        for i in range(self.num_layers):
-            prefix_kv = prefix_keys_values[i] if prefix_keys_values is not None else None
-            key_cache_i = key_cache[i] if key_cache is not None else None
-            value_cache_i = value_cache[i] if value_cache is not None else None
-            hidden_states = self.layers[i](hidden_states, freqs_cis, mask, batch_valid_length=batch_valid_length,
-                                           block_tables=block_tables, slot_mapping=slot_mapping,
-                                           prefix_keys_values=prefix_kv, q_seq_lens=q_seq_lens,
-                                           key_cache=key_cache_i, value_cache=value_cache_i)
-
-        if self.post_norm:
-            hidden_states = self.norm_out(hidden_states)
-        return hidden_states
diff --git a/research/llama3_1/llama.py b/research/llama3_1/llama.py
deleted file mode 100644
index 42dee2d2bf210dd09073a04f6e9fc0843a3cb9ea..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama.py
+++ /dev/null
@@ -1,434 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-LLaMA models' APIs.
-"""
-from multiprocessing.managers import DictProxy
-from multiprocessing.synchronize import Condition
-
-from safetensors import safe_open
-import numpy as np
-
-import mindspore as ms
-import mindspore.common.dtype as mstype
-from mindspore import Tensor, ops, mint, mutable
-from mindspore.communication._comm_helper import _is_initialized as mindspore_comm_has_init
-
-from mindspore.communication import get_group_size
-from mindformers.parallel_core.process_group_config import ModelCommProcessGroups
-from mindformers.parallel_core.inference.parallel_state import initialize_model_parallel, is_initialized
-from mindformers.models.llama.llama import LlamaPreTrainedModel
-from mindformers.modules import Linear
-from mindformers.utils import deprecated
-from mindformers.tools.register.register import MindFormerModuleType, MindFormerRegister
-from mindformers.tools.utils import get_predict_run_mode
-from mindformers.tools.logger import logger
-from mindformers.models.utils import jit
-from mindformers.generation.utils import convert_pin
-from research.llama3_1.infer.layers import ColumnParallelLinear
-from research.llama3_1.infer.transformer import ParallelTransformer
-from research.llama3_1.utils import convert_model_config
-
-
-@deprecated(reason="This method is rotten.", version="1.6.0")
-@MindFormerRegister.register(MindFormerModuleType.MODELS)
-class ParallelLlamaForCausalLM(LlamaPreTrainedModel):
-    r"""
-    Provide llama training loss or logits through network.
-
-    Args:
-        config (LlamaConfig): The config of llama model.
-
-    Returns:
-        output: Tensor, the output of llama decoderlayer
-
-    """
-
-    def __init__(self, config):
-        super().__init__(config, auto_prefix=True)
-        self.config = convert_model_config(config)
-        if not is_initialized() and mindspore_comm_has_init():
-            initialize_model_parallel(get_group_size(), order='tp')
-        if is_initialized():
-            model_comm_pgs = ModelCommProcessGroups.use_parallel_state_groups(required_groups=['tp'])
-        else:
-            model_comm_pgs = ModelCommProcessGroups.get_default_model_comm_pgs()
-        self.ignore_token_id = config.ignore_token_id
-        self.pad_token_id = config.pad_token_id
-        self.use_past = config.use_past
-        self.vocab_size = config.vocab_size
-        self.is_first_iteration = True
-
-        self.shape = ops.Shape()
-        self.reshape = ops.Reshape()
-        self.cast = ops.Cast()
-        self.slice = ops.StridedSlice()
-        self.not_equal = ops.NotEqual()
-        self.mul = ops.Mul()
-        self.add = ops.Add()
-        self.ones = ops.Ones()
-        self.gather = ops.Gather()
-        self.sub_batch_valid_len = ops.Sub()
-        self.model = ParallelTransformer(config=config, model_comm_pgs=model_comm_pgs)
-        if config.parallel_config.vocab_emb_dp:
-            self.lm_head = Linear(
-                in_channels=config.hidden_size,
-                out_channels=config.vocab_size,
-                weight_init="normal",
-                has_bias=False,
-                param_init_type=config.param_init_type,
-                compute_dtype=config.compute_dtype
-            )
-        else:
-            self.lm_head = ColumnParallelLinear(
-                input_size=config.hidden_size,
-                output_size=config.vocab_size,
-                config=config.parallel_config,
-                bias=False,
-                gather_output=True,
-                param_init_type=config.param_init_dtype,
-                compute_dtype=config.compute_dtype,
-                tp_group=model_comm_pgs.tp,
-            )
-
-        self.load_checkpoint(config)
-        self.predict_run_mode = get_predict_run_mode()
-
-        self.use_past = config.use_past
-        self.npu_mem_size = config.npu_mem_size if hasattr(config, "npu_mem_size") else 2
-
-    def prepare_inputs_for_predict_layout(self, input_ids, **kwargs):
-        """Get Llama model input tuple for transform ckpt."""
-        input_ids = Tensor(input_ids, mstype.int32)
-        labels = Tensor(kwargs["labels"]) if "labels" in kwargs else None
-        bs, seq = input_ids.shape[0], input_ids.shape[1]
-        slot_mapping = Tensor(np.ones(shape=tuple([bs * seq])), mstype.int32)
-        prefix_keys_values = Tensor(kwargs["prefix_keys_values"]) if "prefix_keys_values" in kwargs else None
-        return input_ids, labels, None, None, None, None, None, None, None, None, None, slot_mapping, prefix_keys_values
-
-    def prepare_inputs_for_generation(self, input_ids, **kwargs):
-        """
-        prepare inputs for generation.
-        A model class needs to define a `prepare_inputs_for_generation` method
-        in order to use `.generate()`
-
-        """
-        model_inputs = {"input_ids": Tensor.from_numpy(input_ids.astype(np.int32))}
-        batch_valid_length = kwargs.get("valid_length_each_example")
-        prefill = kwargs.get("prefill")
-
-        if self.config.is_dynamic and prefill and "origin_inputs" in kwargs:
-            origin_inputs = kwargs["origin_inputs"]
-            slot_mapping = kwargs.get("slot_mapping")
-            model_inputs = self._prepare_inputs_for_prefill_flatten(origin_inputs,
-                                                                    batch_valid_length,
-                                                                    slot_mapping,
-                                                                    model_inputs,)
-        position_ids = batch_valid_length.astype(np.int32) - 1
-        model_inputs["position_ids"] = ms.Tensor(position_ids, dtype=ms.int32).reshape(-1)
-
-        if not prefill:
-            q_seq_lens = np.ones(batch_valid_length.shape, dtype=np.int32).reshape(-1)
-        else:
-            q_seq_lens = batch_valid_length.astype(np.int32).reshape(-1)
-        model_inputs["q_seq_lens"] = convert_pin(Tensor.from_numpy(q_seq_lens))
-
-        model_inputs["attention_mask"] = self.model.casual_mask.gen_attention_mask(prefill)
-        model_inputs["need_flatten"] = True
-        return model_inputs
-
-    def set_dynamic_inputs(self, **kwargs):
-        """Prepare inputs for dynamic shape."""
-        dynamic_input_ids = Tensor(shape=[None], dtype=mstype.int32)
-        dynamic_batch_valid_length = Tensor(shape=[None], dtype=mstype.int32)
-        dynamic_block_tables = Tensor(shape=[None, None], dtype=mstype.int32)
-        dynamic_slot_mapping = Tensor(shape=[None], dtype=mstype.int32)
-        dynamic_position_ids = Tensor(shape=[None], dtype=mstype.int32)
-        dynamic_q_seq_lens = Tensor(shape=[None], dtype=mstype.int32)
-        dynamic_attention_mask = Tensor(shape=[None, None], dtype=self.config.compute_dtype)
-        have_prefix_keys_values = getattr(kwargs, "have_prefix_keys_values", False)
-
-        def get_input():
-            if self.npu_mem_size > 0:
-                return None
-            cache_list = []
-            for _ in self.model.layers:
-                cache_list.append(Tensor(shape=[None, None, None, None], dtype=self.config.compute_dtype))
-            return mutable(cache_list)
-        key_cache = get_input()
-        value_cache = get_input()
-        if have_prefix_keys_values:
-            dynamic_prefix_keys_values = Tensor(shape=[2, None, None, None, None], dtype=mstype.float16)
-            self.set_inputs(dynamic_input_ids, None, None, None, None, None, None,
-                            dynamic_batch_valid_length, None, None, dynamic_block_tables,
-                            dynamic_slot_mapping, dynamic_prefix_keys_values, None, key_cache, value_cache)
-        else:
-            self.set_inputs(dynamic_input_ids, None, None, dynamic_position_ids, dynamic_attention_mask, None, None,
-                            dynamic_batch_valid_length, None, None, dynamic_block_tables,
-                            dynamic_slot_mapping, None, None, key_cache, value_cache, dynamic_q_seq_lens)
-        logger.info("Set dynamic input for llama.")
-
-    def add_flags_custom(self, is_first_iteration):
-        """Add customized attributes for specific cells in the model."""
-        self.add_flags(is_first_iteration=is_first_iteration)
-        self.model.add_flags(is_first_iteration=is_first_iteration)
-        for layer in self.model.layers:
-            layer.add_flags(is_first_iteration=is_first_iteration)
-            layer.attention.add_flags(is_first_iteration=is_first_iteration)
-            layer.attention.paged_attention_mgr.add_flags(is_first_iteration=is_first_iteration)
-
-    @jit
-    def construct(self, input_ids, labels=None, input_position=None, position_ids=None, attention_mask=None,
-                  input_embeds=None, init_reset=None, batch_valid_length=None, batch_index=None, zactivate_len=None,
-                  block_tables=None, slot_mapping=None, prefix_keys_values=None, llm_boost_inputs=None,
-                  key_cache=None, value_cache=None, q_seq_lens=None):
-        """
-        Forward of llama model.
-        """
-        output = self.model(input_ids, batch_valid_length, batch_index, zactivate_len, block_tables,
-                            slot_mapping, prefix_keys_values, key_cache=key_cache, value_cache=value_cache,
-                            position_ids=position_ids, attention_mask=attention_mask, q_seq_lens=q_seq_lens)
-        pre_gather = (not self.use_past or self.is_first_iteration) and batch_valid_length is not None
-        if pre_gather:
-            batch_valid_length = mint.cumsum(batch_valid_length, 0)
-            output = self.gather(output, self.sub_batch_valid_len(batch_valid_length, 1), 0)
-        logits = self.lm_head(output)
-
-        logits = self.cast(logits, mstype.float32)
-        if self.predict_run_mode:
-            return self.reshape(logits, (-1, logits.shape[-1]))
-        input_mask = self.cast(self.not_equal(input_ids, self.pad_token_id), mstype.float32)
-        return logits, input_ids, input_mask
-
-    def kvcache(self, layer_idx):
-        key_cache = self.model.layers[layer_idx].attention.paged_attention_mgr.key_cache
-        value_cache = self.model.layers[layer_idx].attention.paged_attention_mgr.value_cache
-        return key_cache, value_cache
-
-    @classmethod
-    def convert_name(cls, weight_name):
-        """convert HuggingFace weight name to MindFormers weight name"""
-        origin_name = weight_name
-        weight_name = weight_name.replace('embed_tokens.', 'tok_embeddings.')
-        weight_name = weight_name.replace('.self_attn.q_proj.', '.attention.wq.')
-        weight_name = weight_name.replace('.self_attn.k_proj.', '.attention.wk.')
-        weight_name = weight_name.replace('.self_attn.v_proj.', '.attention.wv.')
-        weight_name = weight_name.replace('.self_attn.o_proj.', '.attention.wo.')
-        weight_name = weight_name.replace('.mlp.gate_proj.', '.feed_forward.w1.')
-        weight_name = weight_name.replace('.mlp.down_proj.', '.feed_forward.w2.')
-        weight_name = weight_name.replace('.mlp.up_proj.', '.feed_forward.w3.')
-        weight_name = weight_name.replace('.input_layernorm.', '.attention_norm.')
-        weight_name = weight_name.replace('.post_attention_layernorm.', '.ffn_norm.')
-        weight_name = weight_name.replace('.norm.', '.norm_out.')
-        weight_name = weight_name.replace('output.', 'lm_head.')
-        weight_name = weight_name.replace('.tok_embeddings.weight', '.tok_embeddings.embedding_weight')
-        if weight_name == origin_name:
-            logger.warning(f"weight name '{weight_name}' does not change after conversion. "
-                           f"Please check if it is as expected.")
-        return weight_name
-
-    @classmethod
-    def convert_weight_dict(cls, source_dict, **kwargs):
-        """convert HuggingFace weight dict to MindFormers weight dict"""
-        model_config = kwargs.get("model_config")
-        qkv_concat = model_config.qkv_concat
-        target_dict = {}
-        wq_keys = []
-        wk_keys = []
-        wv_keys = []
-        w1_keys = []
-        w3_keys = []
-
-        for k, v in source_dict.items():
-            k = cls.convert_name(k)
-            target_dict.update({k: v})
-            if qkv_concat:
-                part = k.split('.')
-                if part[-2] == 'wq':
-                    wq_keys.append(k)
-                if part[-2] == 'wk':
-                    wk_keys.append(k)
-                if part[-2] == 'wv':
-                    wv_keys.append(k)
-                if part[-2] == 'w1':
-                    w1_keys.append(k)
-                if part[-2] == 'w3':
-                    w3_keys.append(k)
-
-        if qkv_concat:
-            qkv_dict = kwargs.get('qkv_dict', None)
-            if not isinstance(qkv_dict, DictProxy):
-                raise ValueError(f'qkv_queue must be a queue, when qkv_concat is True, but got {qkv_dict}.')
-            condition = kwargs.get('condition', None)
-            if not isinstance(condition, Condition):
-                raise ValueError(f'condition must be a Condition, when qkv_concat is True, but got {condition}.')
-            _concat_qkv_weight(wq_keys, wk_keys, wv_keys, model_config, qkv_dict, condition, target_dict)
-            _concat_ffn_weight(w1_keys, w3_keys, model_config, qkv_dict, condition, target_dict)
-
-        return target_dict
-
-    @classmethod
-    def convert_map_dict(cls, source_dict, **kwargs):
-        """convert HuggingFace map dict to MindFormers map dict"""
-        qkv_concat = kwargs.pop("qkv_concat", False)
-        target_dict = {}
-        wq_keys = []
-        w1_keys = []
-
-        for k, v in source_dict.items():
-            k = cls.convert_name(k)
-            target_dict.update({k: v})
-            if qkv_concat:
-                part = k.split('.')
-                if part[-2] == 'wq':
-                    wq_keys.append(k)
-                if part[-2] == 'w1':
-                    w1_keys.append(k)
-
-        if qkv_concat:
-            for wq_key in wq_keys:
-                wk_key = wq_key.replace('wq', 'wk')
-                wv_key = wq_key.replace('wq', 'wv')
-                wq_value = target_dict.pop(wq_key)
-                target_dict.pop(wk_key)
-                target_dict.pop(wv_key)
-
-                w_qkv_key = wq_key.replace('wq', 'w_qkv')
-                w_qkv_value = wq_value
-                target_dict.update({w_qkv_key: w_qkv_value})
-            for w1_key in w1_keys:
-                w3_key = w1_key.replace('w1', 'w3')
-                w1_value = target_dict.pop(w1_key)
-                target_dict.pop(w3_key)
-
-                w_gate_hidden_key = w1_key.replace('w1', 'w_gate_hidden')
-                w_gate_hidden_value = w1_value
-                target_dict.update({w_gate_hidden_key: w_gate_hidden_value})
-
-        return target_dict
-
-    @classmethod
-    def obtain_qkv_ffn_concat_keys(cls):
-        qkv_key = "w_qkv"
-        ffn_key = "w_gate_hidden"
-        concat_keys = [qkv_key, ffn_key]
-        logger.info(f"{cls.__name__} qkv/ffn concat keys are {concat_keys}")
-        return concat_keys
-
-    @classmethod
-    def obtain_name_map(cls, load_checkpoint_files):
-        name_map = dict()
-        for checkpoint_file in load_checkpoint_files:
-            with safe_open(checkpoint_file, framework="np") as f:
-                for k in f.keys():
-                    name_map.update({cls.convert_name(k): k})
-        return name_map
-
-    def clear_kv_cache(self):
-        return self.model.clear_kv_cache()
-
-
-def _concat_qkv_weight(wq_keys, wk_keys, wv_keys, model_config, qkv_dict, condition, target_dict):
-    """concat qkv weight from dicts"""
-    from mindformers.utils.convert_utils import qkv_concat_hf2mg
-
-    num_heads = model_config.num_heads
-    n_kv_heads = model_config.n_kv_heads or num_heads
-    hidden_size = model_config.hidden_size
-
-    # pop extra weight to shared dict if there is no corresponding weight for concat in the target dict
-    for wk_key in wk_keys:
-        wq_key = wk_key.replace('wk', 'wq')
-        if wq_key not in wq_keys:
-            with condition:
-                qkv_dict[wk_key] = target_dict.pop(wk_key)  # add extra weight to shared dict
-                condition.notify_all()
-    for wv_key in wv_keys:
-        wq_key = wv_key.replace('wv', 'wq')
-        if wq_key not in wq_keys:
-            with condition:
-                qkv_dict[wv_key] = target_dict.pop(wv_key)  # add extra weight to shared dict
-                condition.notify_all()
-
-    # concat qkv
-    for wq_key in wq_keys:
-        wk_key = wq_key.replace('wq', 'wk')
-        wv_key = wq_key.replace('wq', 'wv')
-        wq_value = target_dict.pop(wq_key)
-        wk_value = target_dict.pop(wk_key, None)
-        wv_value = target_dict.pop(wv_key, None)
-
-        # get missing weight from shared dict
-        if wk_value is None:
-            with condition:
-                condition.wait_for(lambda: wk_key in qkv_dict.keys())
-                wk_value = qkv_dict.pop(wk_key)
-        if wv_value is None:
-            with condition:
-                condition.wait_for(lambda: wv_key in qkv_dict.keys())
-                wv_value = qkv_dict.pop(wv_key)
-
-        w_qkv_key = wq_key.replace('wq', 'w_qkv')
-        w_qkv_value = np.concatenate((wq_value, wk_value, wv_value), 0)
-        # qkv weight format: hf -> mg
-        w_qkv_value_mg = qkv_concat_hf2mg(w_qkv_value, num_heads, n_kv_heads, hidden_size)
-        target_dict.update({w_qkv_key: w_qkv_value_mg})
-
-
-def _concat_ffn_weight(w1_keys, w3_keys, model_config, qkv_dict, condition, target_dict):
-    """concat ffn weight from dicts"""
-    from mindformers.utils.convert_utils import ffn_concat_hf2mg
-
-    intermediate_size = model_config.intermediate_size
-    ffn_dim_multiplier = model_config.ffn_dim_multiplier
-    multiple_of = model_config.multiple_of or 256
-    ffn_hidden_size = model_config.hidden_size * 4
-    if intermediate_size is not None:
-        ffn_hidden_size = intermediate_size
-    else:
-        if ffn_dim_multiplier is not None:
-            ffn_hidden_size = int((ffn_dim_multiplier + 0.01) * ffn_hidden_size)
-        ffn_hidden_size = int(2 * ffn_hidden_size / 3)
-        ffn_hidden_size = multiple_of * \
-            ((ffn_hidden_size + multiple_of - 1) // multiple_of)
-
-    # pop extra weight to shared dict if there is no corresponding weight for concat in the target dict
-    for w3_key in w3_keys:
-        w1_key = w3_key.replace('w3', 'w1')
-        if w1_key not in w1_keys:
-            with condition:
-                qkv_dict[w3_key] = target_dict.pop(w3_key)  # add extra weight to shared dict
-                condition.notify_all()
-
-    # concat ffn
-    for w1_key in w1_keys:
-        w3_key = w1_key.replace('w1', 'w3')
-        w1_value = target_dict.pop(w1_key)
-        w3_value = target_dict.pop(w3_key, None)
-
-        # get missing weight from shared dict
-        if w3_value is None:
-            with condition:
-                condition.wait_for(lambda: w3_key in qkv_dict.keys())
-                w3_value = qkv_dict.pop(w3_key)
-
-        w_gate_hidden_key = w1_key.replace('w1', 'w_gate_hidden')
-        w_gate_hidden_value = np.concatenate((w1_value, w3_value), 0)
-        # ffn weight format: hf -> mg
-        w_gate_hidden_value_mg = ffn_concat_hf2mg(w_gate_hidden_value, ffn_hidden_size)
-        target_dict.update({w_gate_hidden_key: w_gate_hidden_value_mg})
diff --git a/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml b/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml
deleted file mode 100644
index 71fba283777a8f4c6f17e431809c40409ae10885..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_70b/finetune_llama3_1_70b.yaml
+++ /dev/null
@@ -1,163 +0,0 @@
-seed: 0
-output_dir: './output' # path to save checkpoint/strategy
-load_checkpoint: ''
-src_strategy_path_or_dir: ''
-auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
-only_save_strategy: False
-resume_training: False
-run_mode: 'train'
-load_ckpt_format: 'ckpt' # recommend use 'safetensors'
-
-# trainer config
-trainer:
-  type: CausalLanguageModelingTrainer
-  model_name: 'llama3_1_70b'
-
-# runner config
-runner_config:
-  epochs: 2
-  batch_size: 1
-  sink_mode: True
-  sink_size: 1
-
-# optimizer
-optimizer:
-  type: AdamW
-  betas: [0.9, 0.999]
-  eps: 1.e-8
-
-# lr schedule
-lr_schedule:
-  type: CosineWithWarmUpLR
-  learning_rate: 1.e-5
-  lr_end: 0.0
-  warmup_ratio: 0.03
-  total_steps: -1 # -1 means it will load the total steps of the dataset
-
-# dataset
-train_dataset: &train_dataset
-  data_loader:
-    type: MindDataset
-    dataset_dir: ""
-    shuffle: True
-  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
-  num_parallel_workers: 8
-  python_multiprocessing: False
-  drop_remainder: True
-  numa_enable: False
-  prefetch_size: 1
-train_dataset_task:
-  type: CausalLanguageModelDataset
-  dataset_config: *train_dataset
-
-use_parallel: True
-# parallel context config
-parallel:
-  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
-  gradients_mean: False
-  enable_alltoall: False
-  full_batch: True
-  search_mode: "sharding_propagation"
-  enable_parallel_optimizer: True
-  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
-  parallel_optimizer_config:
-    gradient_accumulation_shard: False
-    parallel_optimizer_threshold: 64
-# default parallel of device num = 8 for Atlas 800T A2
-parallel_config:
-  data_parallel: 1
-  model_parallel: 8
-  pipeline_stage: 8
-  use_seq_parallel: True
-  micro_batch_num: 256
-  vocab_emb_dp: False
-  gradient_aggregation_group: 4
-# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
-micro_batch_interleave_num: 1
-
-# recompute config
-recompute_config:
-  recompute: False
-  select_recompute: [10,8,6,4,2,0,0,0]
-  select_comm_recompute: [10,8,6,4,2,0,0,0]
-  parallel_optimizer_comm_recompute: False
-  mp_comm_recompute: True
-  recompute_slice_activation: True
-
-# callbacks
-callbacks:
-  - type: MFLossMonitor
-  - type: CheckpointMonitor
-    prefix: "llama3_1_70b"
-    checkpoint_format: "ckpt" # recommend use 'safetensors'
-    save_checkpoint_steps: 10000
-    integrated_save: False
-
-# mindspore context init config
-context:
-  mode: 0 #0--Graph Mode; 1--Pynative Mode
-  device_target: "Ascend"
-  max_call_depth: 10000
-  max_device_memory: "52.5GB"
-  mempool_block_size: "52.5GB"
-  save_graphs: False
-  save_graphs_path: "./graph"
-  device_id: 0
-  jit_config:
-    jit_level: "O1"
-  memory_optimize_level: "O0"
-
-# model config
-model:
-  model_config:
-    type: LlamaConfig
-    batch_size: 1 # add for increase predict
-    seq_length: 8192
-    hidden_size: 8192
-    num_layers: 80
-    num_heads: 64
-    n_kv_heads: 8
-    ffn_dim_multiplier: 1.3
-    multiple_of: 256
-    vocab_size: 128256
-    rms_norm_eps: 1.0e-5
-    bos_token_id: 128000
-    eos_token_id: 128001
-    pad_token_id: 128002
-    ignore_token_id: -100
-    compute_dtype: "bfloat16"
-    layernorm_compute_type: "float32"
-    softmax_compute_type: "float32"
-    rotary_dtype: "float32"
-    param_init_type: "float32"
-    use_past: False
-    scaling_factor: 1.0
-    theta: 500000
-    extend_method: "None" # support "None", "PI", "NTK"
-    use_flash_attention: True # FA can accelerate training or finetune
-    offset: 0
-    fine_grain_interleave: 2
-    checkpoint_name_or_path: ""
-    repetition_penalty: 1
-    max_decode_length: 512
-    top_k: 3
-    top_p: 1
-    do_sample: False
-  arch:
-    type: LlamaForCausalLM
-
-# wrapper cell config
-runner_wrapper:
-  type: MFTrainOneStepCell
-  scale_sense: 1.0
-  use_clip_grad: True
-
-profile: False
-profile_start_step: 4
-profile_stop_step: 8
-init_start_profile: False
-profile_communication: False
-profile_memory: True
-layer_scale: False
-layer_decay: 0.65
-lr_scale_factor: 256
diff --git a/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml b/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml
deleted file mode 100644
index 58cffa961446efbee9fd24e12b16dc288d67e8d4..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_70b/predict_llama3_1_70b.yaml
+++ /dev/null
@@ -1,135 +0,0 @@
-seed: 0
-output_dir: './output' # path to save checkpoint/strategy
-load_checkpoint: ''
-src_strategy_path_or_dir: ''
-auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
-only_save_strategy: False
-resume_training: False
-run_mode: 'predict'
-
-# trainer config
-trainer:
-  type: CausalLanguageModelingTrainer
-  model_name: 'llama3_1_70b'
-
-# runner config
-runner_config:
-  epochs: 2
-  batch_size: 1
-  sink_mode: True
-  sink_size: 1
-
-use_parallel: True
-# parallel context config
-parallel:
-  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
-  gradients_mean: False
-  enable_alltoall: False
-  full_batch: True
-  search_mode: "sharding_propagation"
-  enable_parallel_optimizer: False
-  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
-  parallel_optimizer_config:
-    gradient_accumulation_shard: False
-    parallel_optimizer_threshold: 64
-# default parallel of device num = 8 for Atlas 800T A2
-parallel_config:
-  data_parallel: 1
-  model_parallel: 4
-  pipeline_stage: 1
-  use_seq_parallel: False
-  micro_batch_num: 1
-  vocab_emb_dp: True
-  gradient_aggregation_group: 4
-# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
-micro_batch_interleave_num: 1
-
-# mindspore context init config
-context:
-  mode: 0 #0--Graph Mode; 1--Pynative Mode
-  device_target: "Ascend"
-  max_call_depth: 10000
-  max_device_memory: "58GB"
-  save_graphs: False
-  save_graphs_path: "./graph"
-  device_id: 0
-
-# model config
-model:
-  model_config:
-    type: LlamaConfig
-    batch_size: 1 # add for increase predict
-    seq_length: 8192
-    hidden_size: 8192
-    num_layers: 80
-    num_heads: 64
-    n_kv_heads: 8
-    ffn_dim_multiplier: 1.3
-    multiple_of: 256
-    vocab_size: 128256
-    rms_norm_eps: 1.0e-5
-    bos_token_id: 128000
-    eos_token_id: 128001
-    pad_token_id: 128002
-    ignore_token_id: -100
-    compute_dtype: "float16"
-    layernorm_compute_type: "float32"
-    softmax_compute_type: "float32"
-    rotary_dtype: "float32"
-    param_init_type: "bfloat16"
-    is_dynamic: True
-    theta: 500000
-    max_position_embedding: 131072
-    extend_method: "LLAMA3" # support "None", "PI", "NTK", "LLAMA3"
-    scaling_factor:
-     factor: 8.0
-     low_freq_factor: 1.0
-     high_freq_factor: 4.0
-     original_max_position_embeddings: 8192
-    use_past: True
-    use_flash_attention: True # FA can accelerate training or finetune
-    offset: 0
-    checkpoint_name_or_path: ""
-    repetition_penalty: 1
-    max_decode_length: 512
-    block_size: 16
-    num_blocks: 512
-    top_k: 3
-    top_p: 1
-    do_sample: False
-    auto_map:
-      AutoTokenizer: [ llama3_1_tokenizer.Llama3Tokenizer, null ]
-  arch:
-    type: LlamaForCausalLM
-
-processor:
-  return_tensors: ms
-  tokenizer:
-    model_max_length: 8192
-    vocab_file: "/path/tokenizer.model"
-    pad_token: "<|reserved_special_token_0|>"
-    type: Llama3Tokenizer
-    auto_register: llama3_1_tokenizer.Llama3Tokenizer
-  type: LlamaProcessor
-
-# metric
-metric:
-  type: PerplexityMetric
-
-
-auto_tune: False
-filepath_prefix: './autotune'
-autotune_per_step: 10
-
-profile: False
-profile_start_step: 4
-profile_stop_step: 8
-init_start_profile: False
-profile_communication: False
-profile_memory: True
-layer_scale: False
-layer_decay: 0.65
-lr_scale_factor: 256
-
-# aicc
-remote_save_url: "Please input obs url on AICC platform."
diff --git a/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml b/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml
deleted file mode 100644
index d3ee286e8a9a8853f2ffeac34c694db2b5902eca..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml
+++ /dev/null
@@ -1,163 +0,0 @@
-seed: 0
-output_dir: './output' # path to save checkpoint/strategy
-load_checkpoint: ''
-src_strategy_path_or_dir: ''
-auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
-only_save_strategy: False
-resume_training: False
-run_mode: 'train'
-load_ckpt_format: 'ckpt' # recommend use 'safetensors'
-
-# trainer config
-trainer:
-  type: CausalLanguageModelingTrainer
-  model_name: 'llama3_1_8b'
-
-# runner config
-runner_config:
-  epochs: 2
-  batch_size: 1
-  sink_mode: True
-  sink_size: 1
-
-# optimizer
-optimizer:
-  type: AdamW
-  betas: [0.9, 0.95]
-  eps: 1.e-8
-
-# lr schedule
-lr_schedule:
-  type: CosineWithWarmUpLR
-  learning_rate: 1.e-5
-  lr_end: 0.0
-  warmup_ratio: 0.03
-  total_steps: -1 # -1 means it will load the total steps of the dataset
-
-# dataset
-train_dataset: &train_dataset
-  data_loader:
-    type: MindDataset
-    dataset_dir: ""
-    shuffle: True
-  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
-  num_parallel_workers: 8
-  python_multiprocessing: False
-  drop_remainder: True
-  numa_enable: False
-  prefetch_size: 1
-train_dataset_task:
-  type: CausalLanguageModelDataset
-  dataset_config: *train_dataset
-# if True, do evaluate during the training process. if false, do nothing.
-# note that the task trainer should support _evaluate_in_training function.
-
-use_parallel: True
-# parallel context config
-parallel:
-  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
-  gradients_mean: False
-  enable_alltoall: False
-  full_batch: True
-  search_mode: "sharding_propagation"
-  enable_parallel_optimizer: True
-  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
-  parallel_optimizer_config:
-    gradient_accumulation_shard: False
-    parallel_optimizer_threshold: 64
-# default parallel of device num = 8 for Atlas 800T A2
-parallel_config:
-  data_parallel: 8
-  model_parallel: 1
-  pipeline_stage: 1
-  use_seq_parallel: False
-  micro_batch_num: 1
-  vocab_emb_dp: True
-  gradient_aggregation_group: 4
-# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
-micro_batch_interleave_num: 1
-
-# recompute config
-recompute_config:
-  recompute: True
-  select_recompute: False
-  parallel_optimizer_comm_recompute: False
-  mp_comm_recompute: True
-  recompute_slice_activation: True
-
-# callbacks
-callbacks:
-  - type: MFLossMonitor
-  - type: CheckpointMonitor
-    checkpoint_format: "ckpt" # recommend use 'safetensors'
-    prefix: "llama3_1_8b"
-    save_checkpoint_steps: 10000
-    integrated_save: False
-
-# mindspore context init config
-context:
-  mode: 0 #0--Graph Mode; 1--Pynative Mode
-  device_target: "Ascend"
-  max_call_depth: 10000
-  max_device_memory: "58GB"
-  save_graphs: False
-  save_graphs_path: "./graph"
-  device_id: 0
-  jit_config:
-    jit_level: "O1"
-  memory_optimize_level: "O0"
-
-# model config
-model:
-  model_config:
-    type: LlamaConfig
-    batch_size: 1 # add for increase predict
-    seq_length: 8192
-    hidden_size: 4096
-    num_layers: 32
-    num_heads: 32
-    n_kv_heads: 8
-    vocab_size: 128256
-    intermediate_size: 14336
-    rms_norm_eps: 1.0e-5
-    bos_token_id: 128000
-    eos_token_id: 128001
-    pad_token_id: 128002
-    ignore_token_id: -100
-    compute_dtype: "bfloat16"
-    layernorm_compute_type: "float32"
-    softmax_compute_type: "float32"
-    rotary_dtype: "float32"
-    param_init_type: "float16" # for stable training, suggest configuring float32
-    embedding_init_type: "bfloat16"
-    use_past: False
-    scaling_factor: 1.0
-    theta: 500000
-    extend_method: "None" # support "None", "PI", "NTK"
-    use_flash_attention: True # FA can accelerate training or finetune
-    offset: 0
-    fine_grain_interleave: 1
-    checkpoint_name_or_path: ""
-    repetition_penalty: 1
-    max_decode_length: 512
-    top_k: 3
-    top_p: 1
-    do_sample: False
-  arch:
-    type: LlamaForCausalLM
-
-# wrapper cell config
-runner_wrapper:
-  type: MFTrainOneStepCell
-  scale_sense: 1.0
-  use_clip_grad: True
-
-profile: False
-profile_start_step: 4
-profile_stop_step: 8
-init_start_profile: False
-profile_communication: False
-profile_memory: True
-layer_scale: False
-layer_decay: 0.65
-lr_scale_factor: 256
diff --git a/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml b/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml
deleted file mode 100644
index 4b8d1c2b13949e6217a2def983bb8013ebed3dc0..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_8b/predict_llama3_1_8b.yaml
+++ /dev/null
@@ -1,135 +0,0 @@
-seed: 0
-output_dir: './output' # path to save checkpoint/strategy
-load_checkpoint: ''
-src_strategy_path_or_dir: ''
-auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
-only_save_strategy: False
-resume_training: False
-run_mode: 'predict'
-
-# trainer config
-trainer:
-  type: CausalLanguageModelingTrainer
-  model_name: 'llama3_1_8b'
-
-# runner config
-runner_config:
-  epochs: 2
-  batch_size: 1
-  sink_mode: True
-  sink_size: 1
-
-use_parallel: False
-# parallel context config
-parallel:
-  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
-  gradients_mean: False
-  enable_alltoall: False
-  full_batch: True
-  search_mode: "sharding_propagation"
-  enable_parallel_optimizer: False
-  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
-  parallel_optimizer_config:
-    gradient_accumulation_shard: False
-    parallel_optimizer_threshold: 64
-# default parallel of device num = 8 for Atlas 800T A2
-parallel_config:
-  data_parallel: 1
-  model_parallel: 1
-  pipeline_stage: 1
-  use_seq_parallel: False
-  micro_batch_num: 1
-  vocab_emb_dp: True
-  gradient_aggregation_group: 4
-# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
-micro_batch_interleave_num: 1
-
-# mindspore context init config
-context:
-  mode: 0 #0--Graph Mode; 1--Pynative Mode
-  device_target: "Ascend"
-  max_call_depth: 10000
-  max_device_memory: "58GB"
-  save_graphs: False
-  save_graphs_path: "./graph"
-  device_id: 0
-
-# model config
-model:
-  model_config:
-    type: LlamaConfig
-    batch_size: 1 # add for increase predict
-    seq_length: 512
-    hidden_size: 4096
-    num_layers: 32
-    num_heads: 32
-    n_kv_heads: 8
-    vocab_size: 128256
-    intermediate_size: 14336
-    rms_norm_eps: 1.0e-5
-    bos_token_id: 128000
-    eos_token_id: 128001
-    pad_token_id: 128002
-    ignore_token_id: -100
-    max_position_embedding: 131072
-    compute_dtype: "float16"
-    layernorm_compute_type: "float32"
-    softmax_compute_type: "float32"
-    rotary_dtype: "float32"
-    param_init_type: "bfloat16"
-    use_past: True
-    is_dynamic: True
-    theta: 500000
-    extend_method: "LLAMA3" # support "None", "PI", "NTK", "LLAMA3"
-    scaling_factor:
-     factor: 8.0
-     low_freq_factor: 1.0
-     high_freq_factor: 4.0
-     original_max_position_embeddings: 8192
-    use_flash_attention: True # FA can accelerate training or finetune
-    offset: 0
-    fine_grain_interleave: 1
-    checkpoint_name_or_path: ""
-    repetition_penalty: 1
-    max_decode_length: 512
-    block_size: 16
-    num_blocks: 512
-    top_k: 3
-    top_p: 1
-    do_sample: False
-    auto_map:
-      AutoTokenizer: [ llama3_1_tokenizer.Llama3Tokenizer, null ]
-  arch:
-    type: LlamaForCausalLM
-
-processor:
-  return_tensors: ms
-  tokenizer:
-    model_max_length: 8192
-    vocab_file: "/path/tokenizer.model"
-    pad_token: "<|reserved_special_token_0|>"
-    type: Llama3Tokenizer
-    auto_register: llama3_1_tokenizer.Llama3Tokenizer
-  type: LlamaProcessor
-
-# metric
-metric:
-  type: PerplexityMetric
-
-
-auto_tune: False
-filepath_prefix: './autotune'
-autotune_per_step: 10
-
-profile: False
-profile_start_step: 4
-profile_stop_step: 8
-init_start_profile: False
-profile_communication: False
-profile_memory: True
-layer_scale: False
-layer_decay: 0.65
-lr_scale_factor: 256
-
-# aicc
-remote_save_url: "Please input obs url on AICC platform."
diff --git a/research/llama3_1/llama3_1_conversation.py b/research/llama3_1/llama3_1_conversation.py
deleted file mode 100644
index c4e466a411a664c2eac4f47b411ea52de0dec96d..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_conversation.py
+++ /dev/null
@@ -1,184 +0,0 @@
-# Adapted from lm-sys@FastChat. Below is the original copyright:
-#    Copyright 2023 Wei-Lin Chiang, Lianmin Zheng, Ying Sheng
-#
-#    Licensed under the Apache License, Version 2.0 (the "License");
-#    you may not use this file except in compliance with the License.
-#    You may obtain a copy of the License at
-#
-#        http://www.apache.org/licenses/LICENSE-2.0
-#
-#    Unless required by applicable law or agreed to in writing, software
-#    distributed under the License is distributed on an "AS IS" BASIS,
-#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#    See the License for the specific language governing permissions and
-#    limitations under the License.
-
-"""conversation prompt templates"""
-
-import dataclasses
-from enum import auto, Enum
-from typing import List, Any
-
-
-class SeparatorStyle(Enum):
-    """Different separator style."""
-
-    ADD_COLON_SINGLE = auto()
-    ADD_COLON_TWO = auto()
-    NO_COLON_SINGLE = auto()
-    BAIZE = auto()
-    DOLLY = auto()
-    RWKV = auto()
-
-
-@dataclasses.dataclass
-class Conversation:
-    """A class that keeps all conversation history."""
-
-    # System prompts
-    system: str
-    # Two roles
-    roles: List[str]
-    # All messages
-    messages: List[List[str]]
-    # Offset of few shot examples
-    offset: int
-    # Separator
-    sep_style: SeparatorStyle
-    sep: str
-    sep2: str = None
-    # Stop criteria (the default one is EOS token)
-    stop_str: str = None
-    # Stops generation if meeting any token in this list
-    stop_token_ids: List[int] = None
-
-    # Used for the state in the gradio servers.
-    conv_id: Any = None
-    skip_next: bool = False
-    model_name: str = None
-
-    def get_prompt(self):
-        """Get the prompt for generation."""
-        if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE:
-            ret = self.system + self.sep
-            for role, message in self.messages:
-                if message:
-                    ret += role + ": " + message + self.sep
-                else:
-                    ret += role + ":"
-            return ret
-        if self.sep_style == SeparatorStyle.ADD_COLON_TWO:
-            seps = [self.sep, self.sep2]
-            ret = self.system + seps[0]
-            for i, (role, message) in enumerate(self.messages):
-                if message:
-                    ret += role + ": " + message + seps[i % 2]
-                else:
-                    ret += role + ":"
-            return ret
-        raise ValueError(f"Invalid style: {self.sep_style}")
-
-    def append_message(self, role, message):
-        """Append a new message."""
-        self.messages.append([role, message])
-
-    def to_openai_api_messages(self):
-        """Convert the conversation to OpenAI chat completion format."""
-        ret = [{"role": "system", "content": self.system}]
-
-        for i, (_, msg) in enumerate(self.messages[self.offset:]):
-            if i % 2 == 0:
-                ret.append({"role": "user", "content": msg})
-            else:
-                if msg is not None:
-                    ret.append({"role": "assistant", "content": msg})
-        return ret
-
-    def copy(self):
-        return Conversation(
-            system=self.system,
-            roles=self.roles,
-            messages=[[x, y] for x, y in self.messages],
-            offset=self.offset,
-            sep_style=self.sep_style,
-            sep=self.sep,
-            sep2=self.sep2,
-            stop_str=self.stop_str,
-            stop_token_ids=self.stop_token_ids,
-            conv_id=self.conv_id,
-            model_name=self.model_name,
-        )
-
-    def dict(self):
-        return {
-            "system": self.system,
-            "roles": self.roles,
-            "messages": self.messages,
-            "offset": self.offset,
-            "conv_id": self.conv_id,
-            "model_name": self.model_name,
-        }
-
-
-# A template with one conversation example
-conv_one_shot = Conversation(
-    system="A chat between a curious human and an artificial intelligence assistant. "
-    "The assistant gives helpful, detailed, and polite answers to the human's questions.",
-    roles=("Human", "Assistant"),
-    messages=(
-        (
-            "Human",
-            "What are the key differences between renewable and non-renewable energy sources?",
-        ),
-        (
-            "Assistant",
-            "Renewable energy sources are those that can be replenished naturally in a relatively "
-            "short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
-            "Non-renewable energy sources, on the other hand, are finite and will eventually be "
-            "depleted, such as coal, oil, and natural gas. Here are some key differences between "
-            "renewable and non-renewable energy sources:\n"
-            "1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
-            "energy sources are finite and will eventually run out.\n"
-            "2. Environmental impact: Renewable energy sources have a much lower environmental impact "
-            "than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
-            "and other negative effects.\n"
-            "3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
-            "have lower operational costs than non-renewable sources.\n"
-            "4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
-            "locations than non-renewable sources.\n"
-            "5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
-            "situations and needs, while non-renewable sources are more rigid and inflexible.\n"
-            "6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
-            "non-renewable sources are not, and their depletion can lead to economic and social instability.",
-        ),
-    ),
-    offset=2,
-    sep_style=SeparatorStyle.ADD_COLON_SINGLE,
-    sep="\n### ",
-    stop_str="###",
-)
-
-
-# Vicuna v1.1 template
-conv_vicuna_v1_1 = Conversation(
-    system="A chat between a curious user and an artificial intelligence assistant. "
-    "The assistant gives helpful, detailed, and polite answers to the user's questions.",
-    roles=("USER", "ASSISTANT"),
-    messages=(),
-    offset=0,
-    sep_style=SeparatorStyle.ADD_COLON_TWO,
-    sep=" ",
-    sep2="</s>",
-)
-
-conv_templates = {
-    "conv_one_shot": conv_one_shot,
-    "vicuna_v1.1": conv_vicuna_v1_1,
-}
-
-
-def get_default_conv_template(model_name):
-    model_name = model_name.lower()
-    if "vicuna" in model_name or "output" in model_name:
-        return conv_vicuna_v1_1
-    return conv_one_shot
diff --git a/research/llama3_1/llama3_1_preprocess.py b/research/llama3_1/llama3_1_preprocess.py
deleted file mode 100644
index d201f46ed8b7b9055954dad9bb11537127ed151e..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_preprocess.py
+++ /dev/null
@@ -1,228 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""
-transform wikitext-2, wikitext-103, lambada, openwebtext dataset to mindrecord.
-"""
-import argparse
-import json
-import os
-import numpy as np
-from mindspore.mindrecord import FileWriter
-from mindformers.tools import logger
-from mindformers.dataset.dataloader.datareaders import wikitext_clean
-from llama3_1_tokenizer import Llama3Tokenizer
-from llama3_1_conversation import get_default_conv_template
-
-
-IGNORE_TOKEN_ID = -100
-
-
-def chunks(lst, n):
-    """ yield n sized chunks from list"""
-    for i in range(0, len(lst), n):
-        yield lst[i:i + n]
-
-
-def preprocess(sources, tokenizer, seq_length):
-    """conversation preprocess."""
-    conv = get_default_conv_template("vicuna").copy()
-    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
-
-    # Apply prompt templates
-    conversations = []
-    for i, source in enumerate(sources):
-        if roles.get(source[0].get("from")) != conv.roles[0]:
-            # Skip the first one if it is not from human
-            source = source[1:]
-
-        conv.messages = []
-        for j, sentence in enumerate(source):
-            role = roles.get(sentence.get("from"))
-            if role != conv.roles[j % 2]:
-                raise ValueError(f"sources[{i}] is wrong.")
-            conv.append_message(role, sentence["value"])
-        conversations.append(conv.get_prompt())
-
-    sep = conv.sep + conv.roles[1] + ": "
-    # Tokenize conversations
-    input_ids = []
-    targets = []
-    for conversation in conversations:
-        rounds = conversation.split(conv.sep2)
-        ids = [tokenizer.bos_token_id]
-        mask = [1]
-        for _, rou in enumerate(rounds):
-            if rou == "":
-                break
-            conv_out = tokenizer(rou)
-            ids.extend(conv_out['input_ids'][1:])
-            mask.extend(conv_out['attention_mask'][1:])
-        d = {'input_ids': ids, 'attention_mask': mask}
-        # pylint: disable=W0212
-        d = tokenizer._pad(d, max_length=seq_length, padding_strategy='max_length')
-        input_ids.append(d['input_ids'][:seq_length])
-
-        target = np.array(d['input_ids'])
-        total_len = int(np.not_equal(target, tokenizer.pad_token_id).sum())
-        cur_len = 1
-        target[:cur_len] = IGNORE_TOKEN_ID
-        for _, rou in enumerate(rounds):
-            if rou == "":
-                break
-            parts = rou.split(sep)
-            if len(parts) != 2:
-                break
-            parts[0] += sep
-            round_len = len(tokenizer(rou)['input_ids']) - 1
-            instruction_len = len(tokenizer(parts[0])['input_ids']) - 3
-
-            target[cur_len: cur_len + instruction_len] = IGNORE_TOKEN_ID
-
-            cur_len += round_len
-        target[cur_len:] = IGNORE_TOKEN_ID
-
-        if cur_len < seq_length:
-            if cur_len != total_len:
-                target[:] = IGNORE_TOKEN_ID
-        else:
-            target = target[:seq_length]
-        targets.append(target.tolist())
-
-    input_ids = np.array(input_ids, dtype=np.int32)
-    targets = np.array(targets, dtype=np.int32)
-
-    return dict(
-        input_ids=input_ids,
-        labels=targets,
-    )
-
-
-class SupervisedDataset:
-    """Dataset for supervised fine-tuning."""
-
-    def __init__(self, raw_data, tokenizer, seq_length):
-        super(SupervisedDataset, self).__init__()
-
-        sources = [example["conversations"] for example in raw_data]
-        data_dict = preprocess(sources, tokenizer, seq_length)
-
-        self.input_ids = data_dict.get("input_ids")
-        self.labels = data_dict.get("labels")
-
-    def __len__(self):
-        return len(self.input_ids)
-
-    def __getitem__(self, i):
-        return dict(
-            input_ids=self.input_ids[i],
-            labels=self.labels[i]
-        )
-
-
-def tokenize_wiki(tokenizer, file_path, seq_length, repeat):
-    """tokenize wikitext-2/wikitext-103 dataset"""
-    content = []
-    with open(file_path, 'r', encoding='utf-8') as f:
-        for para in wikitext_clean(f.read()).split("\n\n"):
-            if para and para.strip().startswith('=') is False:
-                content += tokenizer(para)['input_ids']
-    content_out = []
-    for _ in range(repeat):
-        content_out.extend(content)
-    content = content_out
-    for chunk in chunks(content, seq_length):
-        sample = {}
-        if len(chunk) == seq_length:
-            sample['input_ids'] = np.array(chunk, dtype=np.int32)
-            yield sample
-
-
-# pylint: disable=C0111
-# pylint: disable=W0703
-def tokenize_qa(tokenizer, file_path, seq_length):
-    file = None
-    raw_data = None
-    try:
-        file = open(file_path, "r")
-        raw_data = json.load(file)
-    except FileNotFoundError as file_not_found_error:
-        logger.error(file_not_found_error)
-    except UnicodeDecodeError as decode_error:
-        logger.error(decode_error)
-    except IOError as io_error:
-        logger.error(io_error)
-    except Exception as exception:
-        logger.error(exception)
-    finally:
-        if file is not None:
-            file.close()
-    dataset_cls = SupervisedDataset(raw_data, tokenizer, seq_length)
-    for i, _ in enumerate(dataset_cls):
-        yield dataset_cls[i]
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--dataset_type', type=str, default='wiki', choices=['wiki', 'qa'])
-    parser.add_argument('--input_glob', type=str, default='./dataset/wikitext-2/wiki.train.tokens')
-    parser.add_argument('--output_file', type=str, default='./dataset/wiki8192/wiki8192')
-    parser.add_argument('--tokenizer', type=str, default='llama3', choices=['llama3'])
-    parser.add_argument('--model_file', type=str, default='./ckpt/llama3/tokenizer.model')
-    parser.add_argument('--file_partition', type=int, default=1)
-    parser.add_argument('--repeat', type=int, default=1)
-    parser.add_argument('--seq_length', type=int, default=8192)
-    args = parser.parse_args()
-    # pylint: disable=C0326
-    out_dir, out_file = os.path.split(os.path.abspath(args.output_file))
-    if not os.path.exists(out_dir):
-        os.mkdir(out_dir)
-    if args.dataset_type == 'wiki':
-        schema = {'input_ids': {"type": "int32", "shape": [-1]}, }
-    elif args.dataset_type == 'qa':
-        schema = {'input_ids': {"type": "int32", "shape": [-1]}, 'labels': {"type": "int32", "shape": [-1]}}
-    writer = FileWriter(file_name=args.output_file,
-                        shard_num=args.file_partition)
-    writer.add_schema(schema, args.dataset_type)
-
-    # Start to load tokenizer
-    if not os.path.exists(args.model_file):
-        raise FileNotFoundError(f"file {args.model_file} do not exists.")
-
-    transforms_count = 0
-    word_tokenizer = Llama3Tokenizer(vocab_file=args.model_file)
-    if hasattr(word_tokenizer, 'add_bos_token'):
-        word_tokenizer.add_bos_token = True
-    if hasattr(word_tokenizer, 'add_eos_token'):
-        word_tokenizer.add_eos_token = True
-    if args.dataset_type == 'wiki':
-        for x in tokenize_wiki(word_tokenizer, args.input_glob, args.seq_length + 1, args.repeat):
-            transforms_count += 1
-            writer.write_raw_data([x])
-        print("Transformed {} records.".format(transforms_count))
-    elif args.dataset_type == 'qa':
-        for x in tokenize_qa(word_tokenizer, args.input_glob, args.seq_length + 1):
-            transforms_count += 1
-            writer.write_raw_data([x])
-        print("Transformed {} records.".format(transforms_count))
-    else:
-        raise ValueError(
-            "Not support dataset type: {}".format(args.dataset_type))
-
-    writer.commit()
-    out_file = args.output_file
-    if args.file_partition > 1:
-        out_file += '0'
-    print("Transform finished, output files refer: {}".format(out_file))
diff --git a/research/llama3_1/llama3_1_tokenizer.py b/research/llama3_1/llama3_1_tokenizer.py
deleted file mode 100644
index bcc083c87bfd444cdb42a1d5947491c831e0b186..0000000000000000000000000000000000000000
--- a/research/llama3_1/llama3_1_tokenizer.py
+++ /dev/null
@@ -1,244 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""llama3 tokenizer APIs."""
-
-import base64
-from typing import Collection, Dict, List, Set, Union
-import json
-import unicodedata
-
-from mindformers.models.tokenization_utils import AddedToken, PreTrainedTokenizer
-from mindformers.tools.register import MindFormerRegister, MindFormerModuleType
-from mindformers.tools.utils import check_file
-
-try:
-    import tiktoken
-except ImportError as e:
-    raise ImportError("Package 'tiktoken' required to run Llama3. please install it with pip.") from e
-
-PAT_STR = r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| " \
-          r"?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
-
-
-def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
-    with open(tiktoken_bpe_file, "rb") as f:
-        contents = f.read()
-    return {
-        base64.b64decode(token): int(rank)
-        for token, rank in (line.split() for line in contents.splitlines() if line)
-    }
-
-
-def _load_tokenizer_json(json_file):
-    with open(json_file, "rb") as f:
-        contents = json.loads(f.read())
-    return {
-        bytes(token, encoding='utf8'): int(rank)
-        for token, rank in contents['model']['vocab'].items()
-    }
-
-
-@MindFormerRegister.register(MindFormerModuleType.TOKENIZER)
-class Llama3Tokenizer(PreTrainedTokenizer):
-    """Llama3 Tokenizer"""
-    VOCAB_FILES = {'vocab_file': 'tokenizer.json'}
-    FILE_LIST = []
-    special_tokens: Dict[str, int]
-
-    def __init__(self,
-                 vocab_file,
-                 bos_token="<|begin_of_text|>",
-                 eos_token="<|end_of_text|>",
-                 pad_token="<|reserved_special_token_0|>",
-                 add_bos_token=False,
-                 add_eos_token=False,
-                 errors="replace",
-                 num_reserved_special_tokens=256,
-                 **kwargs):
-        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
-
-        self.errors = errors
-        self.vocab_file = vocab_file
-        check_file(vocab_file, "tokenizer")
-        self.add_bos_token = add_bos_token
-        self.add_eos_token = add_eos_token
-        if vocab_file.split('.')[-1] == 'json':
-            self.mergeable_ranks = _load_tokenizer_json(vocab_file)
-        else:
-            self.mergeable_ranks = _load_tiktoken_bpe(vocab_file)  # type: dict[bytes, int]
-        num_base_tokens = len(self.mergeable_ranks)
-        special_tokens = [
-            "<|begin_of_text|>",
-            "<|end_of_text|>",
-            "<|reserved_special_token_0|>",
-            "<|reserved_special_token_1|>",
-            "<|reserved_special_token_2|>",
-            "<|reserved_special_token_3|>",
-            "<|start_header_id|>",
-            "<|end_header_id|>",
-            "<|reserved_special_token_4|>",
-            "<|eot_id|>",  # end of turn
-        ] + [
-            f"<|reserved_special_token_{i}|>"
-            for i in range(5, num_reserved_special_tokens - 5)
-        ]
-        self.special_tokens = {
-            token: num_base_tokens + i
-            for i, token in enumerate(special_tokens)
-        }
-
-        self.tokenizer = tiktoken.Encoding(
-            "Llama3",
-            pat_str=PAT_STR,
-            mergeable_ranks=self.mergeable_ranks,
-            special_tokens=self.special_tokens,
-        )
-
-        self.decoder = {
-            v: k
-            for k, v in self.mergeable_ranks.items()
-        }  # type: dict[int, bytes|str]
-        self.decoder.update({
-            v: k
-            for k, v in self.special_tokens.items()
-        })
-
-        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
-        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
-        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
-
-        super().__init__(bos_token=bos_token,
-                         eos_token=eos_token,
-                         pad_token=pad_token,
-                         **kwargs)
-
-    @property
-    def vocab_size(self):
-        return self.tokenizer.n_vocab
-
-    def get_vocab(self):
-        """Returns vocab as a dict"""
-        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
-        vocab.update(self.added_tokens_encoder)
-        return vocab
-
-    # override Tokenizer.convert_tokens_to_string()
-    def convert_tokens_to_string(self, tokens: List[Union[bytes, str]]) -> str:
-        """
-        Converts a sequence of tokens in a single string.
-        """
-        text = ""
-        temp = b""
-        for t in tokens:
-            if isinstance(t, str):
-                if temp:
-                    text += temp.decode("utf-8", errors=self.errors)
-                    temp = b""
-                text += t
-            elif isinstance(t, bytes):
-                temp += t
-            else:
-                raise TypeError("token should only be of type types or str")
-        if temp:
-            text += temp.decode("utf-8", errors=self.errors)
-        return text
-
-    # called by Tokenizer.convert_tokens_to_ids() & SpecialTokensMixin
-    def _convert_tokens_to_ids(
-            self, tokens: Union[bytes, str, List[Union[bytes, str]]]
-    ) -> Union[int, List[int]]:
-        """Convert the tokens to ids using vocab mapping"""
-        if isinstance(tokens, (str, bytes)):
-            return self._convert_token_to_id(tokens)
-
-        ids = []
-        for token in tokens:
-            ids.append(self._convert_token_to_id(token))
-        return ids
-
-    def _convert_token_to_id(self, token: Union[bytes, str]) -> int:
-        """Converts a token to an id using the vocab, special tokens included"""
-        if token in self.special_tokens:
-            return self.special_tokens[token]
-        if token in self.mergeable_ranks:
-            return self.mergeable_ranks[token]
-        raise ValueError("unknown token")
-
-    # required by Tokenizer.convert_ids_to_tokens() of mindformers<=0.6
-    def _convert_ids_to_tokens(self, input_id: int):
-        return self._convert_id_to_token(input_id)
-
-    # called by Tokenizer.convert_ids_to_tokens()
-    def _convert_id_to_token(self, index: int) -> Union[bytes, str]:
-        """Converts an id to a token, special tokens included"""
-        if index in self.decoder:
-            return self.decoder[index]
-        raise ValueError("unknown ids")
-
-    # pylint: disable=W0613
-    def tokenize(
-            self,
-            text: str,
-            allowed_special: Union[Set, str] = "all",
-            disallowed_special: Union[Collection, str] = (),
-            **kwargs,
-    ) -> List[Union[bytes, str]]:
-        """
-        Converts a string in a sequence of tokens.
-
-        Args:
-            text (`str`):
-                The sequence to be encoded.
-            allowed_special (`Literal["all"]` or `set`):
-                The surface forms of the tokens to be encoded as special tokens in regular texts.
-                Default to "all".
-            disallowed_special (`Literal["all"]` or `Collection`):
-                The surface forms of the tokens that should not be in regular texts and trigger errors.
-                Default to an empty tuple.
-
-            kwargs (additional keyword arguments, *optional*):
-                Will be passed to the underlying model specific encode method.
-
-        Returns:
-            `List[bytes|str]`: The list of tokens.
-        """
-        tokens = []
-        text = unicodedata.normalize("NFC", text)
-        if self.add_bos_token:
-            tokens.insert(0, self.decoder[self.bos_token_id])
-
-        # this implementation takes a detour: text -> token id -> token surface forms
-        for t in self.tokenizer.encode(
-                text, allowed_special=allowed_special, disallowed_special=disallowed_special
-        ):
-            tokens.append(self.decoder[t])
-        if self.add_eos_token:
-            tokens.append(self.decoder[self.eos_token_id])
-        return tokens
-
-    # pylint: disable=W0613
-    def _decode(
-            self,
-            token_ids: Union[int, List[int]],
-            skip_special_tokens: bool = False,
-            errors: str = None,
-            **kwargs,
-    ) -> str:
-        """override Tokenizer._decode(), called by PreTrainedTokenizerBase.decode()"""
-        if isinstance(token_ids, int):
-            token_ids = [token_ids]
-        if skip_special_tokens:
-            token_ids = [i for i in token_ids if i != self.pad_token_id and i not in self.special_tokens.values()]
-        return self.tokenizer.decode(token_ids, errors=errors or self.errors)
diff --git a/research/llama3_1/utils.py b/research/llama3_1/utils.py
deleted file mode 100644
index 075d04ba1431beb6f8e662936c5f2d8d977d1789..0000000000000000000000000000000000000000
--- a/research/llama3_1/utils.py
+++ /dev/null
@@ -1,67 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-DEPRECATED MODULE
-
-This module is deprecated and will be removed in future releases.
-LLaMA models' utils.
-"""
-
-
-def convert_model_config(configs):
-    """convert model config to dynamic-infer style"""
-    ffn_hidden_size = configs.hidden_size * 4
-    if configs.intermediate_size is not None:
-        ffn_hidden_size = configs.intermediate_size
-    else:
-        if configs.ffn_dim_multiplier is not None:
-            ffn_hidden_size = int((configs.ffn_dim_multiplier + 0.01) * ffn_hidden_size)
-        ffn_hidden_size = int(2 * ffn_hidden_size / 3)
-        ffn_hidden_size = configs.multiple_of * ((ffn_hidden_size + configs.multiple_of - 1) // configs.multiple_of)
-
-    configs.apply_query_key_layer_scaling = False
-    configs.apply_residual_connection_post_norm = False
-    configs.attention_dropout_rate = 0.0
-    configs.attention_type = 'self_attn'
-    configs.ffn_hidden_size = ffn_hidden_size
-    configs.hidden_act = "silu"
-    configs.hidden_dropout_rate = 0.0
-    configs.kv_num_heads = configs.num_heads if configs.n_kv_heads is None else configs.n_kv_heads
-    configs.layernorm_epsilon = configs.rms_norm_eps
-    configs.mask_func_type = "attn_mask_add"
-    configs.mlp_has_bias = False
-    configs.normalization = "RMSNorm"
-    configs.num_experst = None
-    configs.out_proj_has_bias = False
-    configs.param_init_dtype = configs.param_init_type
-    configs.layernorm_compute_dtype = configs.layernorm_compute_type
-    configs.residual_connection_dtype = configs.softmax_compute_type
-    configs.share_embedding_weight = False
-    configs.softmax_compute_dtype = configs.softmax_compute_type
-    configs.use_gqa = False
-    configs.mlp_has_gate = True
-    configs.post_norm = True
-    configs.recompute_granularity = None
-    configs.ffn_concat = configs.qkv_concat
-    configs.is_dynamic = True
-
-    parallel_config = configs.parallel_config
-    parallel_config.tensor_parallel = parallel_config.model_parallel
-    parallel_config.expert_parallel = 1
-    parallel_config.use_sequence_parallel = False
-    parallel_config.use_zero3 = False
-    configs.parallel_config = parallel_config
-
-    return configs
diff --git a/run_mindformer.py b/run_mindformer.py
index 5d4476257c786bb90cef2365b2477e1630d6d7dd..3071803c3223d89bec49f455b34853ea13bcd1a7 100644
--- a/run_mindformer.py
+++ b/run_mindformer.py
@@ -66,7 +66,7 @@ def main(config):
     build_context(config)
 
     trainer = Trainer(config)
-    if config.run_mode == 'train' or config.run_mode == 'finetune':
+    if config.run_mode in ('train', 'finetune'):
         trainer.train()
     elif config.run_mode == 'eval':
         trainer.evaluate(eval_checkpoint=config.load_checkpoint)
@@ -132,8 +132,6 @@ if __name__ == "__main__":
     parser.add_argument(
         '--load_checkpoint', default=None, type=str,
         help="load model checkpoint to train/finetune/eval/predict, "
-             "it is also support input model name, such as 'llama3_1_8b', "
-             "please refer to https://gitee.com/mindspore/mindformers#%E4%BB%8B%E7%BB%8D."
              "Default: None")
     parser.add_argument(
         '--src_strategy_path_or_dir', default=None, type=str,
@@ -217,7 +215,7 @@ if __name__ == "__main__":
                   for item in rest_args_
                   for i in item.split("=")]
     if len(rest_args_) % 2 != 0:
-        raise ValueError(f"input arg key-values are not in pair, please check input args. ")
+        raise ValueError("input arg key-values are not in pair, please check input args. ")
 
     if args_.config is not None and not os.path.isabs(args_.config):
         args_.config = os.path.join(work_path, args_.config)
diff --git a/tests/st/test_grace_exit_save_ckpt/__init__.py b/tests/st/test_grace_exit_save_ckpt/__init__.py
deleted file mode 100644
index 39250f7a209c43909f413f55827e4bf534a72d25..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/__init__.py
+++ /dev/null
@@ -1,15 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""test resume."""
diff --git a/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py b/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py
deleted file mode 100644
index 441dc470ae6d557a0c0ce60e584b48a97c33e02c..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/grace_exit_save_ckpt.py
+++ /dev/null
@@ -1,83 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-Test module for testing resume training from specified checkpoint.
-How to run this:
-pytest tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py
-"""
-import os
-from glob import glob
-import numpy as np
-
-from mindspore.dataset import GeneratorDataset
-
-from mindformers import build_context
-from mindformers.tools.utils import (
-    get_epoch_and_step_from_ckpt_name
-)
-from mindformers.trainer import Trainer
-from mindformers.models.llama import LlamaForCausalLM, LlamaConfig
-from mindformers.tools.register import MindFormerConfig
-
-
-SEED = 42
-NUM_LAYERS = 2
-NUM_HEADS = 4
-HIDDEN_SIZE = 512
-SEQ_LENGTH = 1024
-DATA_SIZE = 1024
-
-
-def generator():
-    """dataset generator"""
-    for i in range(DATA_SIZE):
-        np.random.seed(SEED + i)
-        input_ids = np.random.randint(low=0, high=DATA_SIZE, size=(SEQ_LENGTH + 1,)).astype(np.int32)
-        yield input_ids
-
-
-def get_checkpoints_path(checkpoint_dir):
-    """get checkpoints path"""
-    checkpoints_path = glob(os.path.join(checkpoint_dir, "*.ckpt"))
-    checkpoints_path.sort(key=get_epoch_and_step_from_ckpt_name)
-    return checkpoints_path
-
-
-def llama_trainer_train_from_instance():
-    """
-    Feature: Create Trainer From Instance
-    Description: Test Trainer API to train from self-define instance API.
-    Expectation: TypeError
-    """
-    # Config definition
-
-    config = MindFormerConfig("./test_grace_exit_save_ckpt.yaml")
-    build_context(config)
-
-    model_config = LlamaConfig(num_layers=NUM_LAYERS, seq_length=SEQ_LENGTH,
-                               num_heads=NUM_HEADS, hidden_size=HIDDEN_SIZE,
-                               parallel_config=config.parallel_config)
-    model = LlamaForCausalLM(model_config)
-
-    # Training using first dataset.
-    dataset = GeneratorDataset(generator, column_names=["input_ids"])
-    dataset = dataset.batch(batch_size=8)
-
-    trainer = Trainer(model=model, args=config, train_dataset=dataset)
-
-    trainer.train(train_checkpoint=False)
-
-
-llama_trainer_train_from_instance()
diff --git a/tests/st/test_grace_exit_save_ckpt/graceful_exit.json b/tests/st/test_grace_exit_save_ckpt/graceful_exit.json
deleted file mode 100644
index 65610ba3421930f83d086999aaa78df0c1bbeaf9..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/graceful_exit.json
+++ /dev/null
@@ -1,3 +0,0 @@
-{
-    "GracefulExit": 1
-}
\ No newline at end of file
diff --git a/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh b/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh
deleted file mode 100644
index 1358c6368f9d4f033e086daad614a771f1788edc..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/msrun_launch.sh
+++ /dev/null
@@ -1,26 +0,0 @@
-#!/bin/bash
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-set -e
-BASE_PATH=$(cd "$(dirname $0)"; pwd)
-USE_DEVICE_NUM=$1
-
-msrun --worker_num=${USE_DEVICE_NUM} \
-      --local_worker_num=${USE_DEVICE_NUM} \
-      --master_port=8118 \
-      --log_dir=msrun_log \
-      --join=True \
-      --cluster_time_out=300 \
-      ${BASE_PATH}/grace_exit_save_ckpt.py > grace_exit_save_ckpt.log 2>&1
diff --git a/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml b/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml
deleted file mode 100644
index db9222f64da500816c5f8e30486259c4e73defa8..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/test_grace_exit_save_ckpt.yaml
+++ /dev/null
@@ -1,160 +0,0 @@
-seed: 0
-output_dir: '' # path to save checkpoint/strategy
-load_checkpoint: ""
-src_strategy_path_or_dir: ''
-auto_trans_ckpt: False  # If true, auto transform load_checkpoint to load in distributed model
-only_save_strategy: False
-resume_training: False
-use_graceful_exit: True
-run_mode: 'train'
-
-# trainer config
-trainer:
-  type: CausalLanguageModelingTrainer
-  model_name: 'llama3_1_8b'
-
-# runner config
-runner_config:
-  epochs: 2
-  batch_size: 1
-  sink_mode: True
-  sink_size: 1
-
-# optimizer
-optimizer:
-  type: AdamW
-  betas: [0.9, 0.95]
-  eps: 1.e-8
-
-# lr schedule
-lr_schedule:
-  type: CosineWithWarmUpLR
-  learning_rate: 1.e-5
-  lr_end: 0.0
-  warmup_ratio: 0.03
-  total_steps: -1 # -1 means it will load the total steps of the dataset
-
-# dataset
-train_dataset: &train_dataset
-  data_loader:
-    type: MindDataset
-    dataset_dir: ""
-    shuffle: True
-  input_columns: ["input_ids", "labels"]  # "input_ids", "labels" , labels are used in instruction finetune.
-  num_parallel_workers: 8
-  python_multiprocessing: False
-  drop_remainder: True
-  batch_size: 6
-  repeat: 1
-  numa_enable: False
-  prefetch_size: 1
-train_dataset_task:
-  type: CausalLanguageModelDataset
-  dataset_config: *train_dataset
-# if True, do evaluate during the training process. if false, do nothing.
-# note that the task trainer should support _evaluate_in_training function.
-do_eval: False
-
-use_parallel: True
-# parallel context config
-parallel:
-  parallel_mode: 1 # 0-data parallel, 1-semi-auto parallel, 2-auto parallel, 3-hybrid parallel
-  gradients_mean: False
-  enable_alltoall: False
-  full_batch: True
-  search_mode: "sharding_propagation"
-  enable_parallel_optimizer: False
-  strategy_ckpt_save_file: "./ckpt_strategy.ckpt"
-  parallel_optimizer_config:
-    gradient_accumulation_shard: False
-    parallel_optimizer_threshold: 64
-# default parallel of device num = 8 for Atlas 800T A2
-parallel_config:
-  data_parallel: 2
-  model_parallel: 4
-  pipeline_stage: 1
-  use_seq_parallel: False
-  micro_batch_num: 1
-  vocab_emb_dp: True
-  gradient_aggregation_group: 4
-# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
-micro_batch_interleave_num: 1
-
-# recompute config
-recompute_config:
-  recompute: True
-  select_recompute: False
-  parallel_optimizer_comm_recompute: False
-  mp_comm_recompute: True
-  recompute_slice_activation: True
-
-# callbacks
-callbacks:
-  - type: MFLossMonitor
-  - type: OnRequestExit
-    save_ckpt: True
-    save_mindir: False
-    file_name: Llama
-    directory: "./grace_ckpt/"
-    config_file: "./graceful_exit.json"
-
-
-# mindspore context init config
-context:
-  mode: 0 #0--Graph Mode; 1--Pynative Mode
-  device_target: "Ascend"
-  max_call_depth: 10000
-  max_device_memory: "58GB"
-  save_graphs: False
-  save_graphs_path: "./graph"
-  device_id: 0
-  jit_config:
-    jit_level: "O1"
-  memory_optimize_level: "O0"
-  
-# model config
-model:
-  model_config:
-    type: LlamaConfig
-    batch_size: 1 # add for increase predict
-    seq_length: 1024
-    hidden_size: 512
-    num_layers: 2
-    num_heads: 4
-    n_kv_heads: 8
-    vocab_size: 128256
-    intermediate_size: 14336
-    rms_norm_eps: 1.0e-5
-    bos_token_id: 128000
-    eos_token_id: 128001
-    pad_token_id: 128002
-    ignore_token_id: -100
-    compute_dtype: "bfloat16"
-    layernorm_compute_type: "float32"
-    softmax_compute_type: "float32"
-    rotary_dtype: "float32"
-    param_init_type: "float16"
-    embedding_init_type: "bfloat16"
-    use_past: False
-    scaling_factor: 1.0
-    theta: 500000
-    extend_method: "None" # support "None", "PI", "NTK"
-    use_flash_attention: True # FA can accelerate training or finetune
-    offset: 0
-    fine_grain_interleave: 1
-    checkpoint_name_or_path: ""
-    repetition_penalty: 1
-    max_decode_length: 512
-    top_k: 3
-    top_p: 1
-    do_sample: False
-  arch:
-    type: LlamaForCausalLM
-
-
-# wrapper cell config
-runner_wrapper:
-  type: MFTrainOneStepCell
-  scale_sense: 1.0
-  use_clip_grad: True
-
diff --git a/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py b/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py
deleted file mode 100644
index 0d172c7856ff33112e52054da1d30b7f9dde9eb8..0000000000000000000000000000000000000000
--- a/tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py
+++ /dev/null
@@ -1,49 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""
-Test module for testing resume training from specified checkpoint.
-How to run this:
-pytest tests/st/test_grace_exit_save_ckpt/test_parallel_grace_exit_save_ckpt.py
-"""
-import os
-
-class TestSaveTtpCkpt:
-    """A test class for testing save_ttp_ckpt."""
-
-    def test_train(self):
-        """
-        Feature: Trainer.train()
-        Description: Test parallel grace_exit_save_ckpt for train.
-        Expectation: AssertionError
-        """
-        sh_path = os.path.split(os.path.realpath(__file__))[0]
-        ret = os.system(f"bash {sh_path}/msrun_launch.sh 8")
-        assert ret == 0
-        checkpoint_dir = "./grace_ckpt"
-        path = "./msrun_log/worker_0.log"
-        flag = False
-        assert os.path.exists(path)
-        with open(path, 'r') as file:
-            content = file.read()
-            if "Graceful exit is triggered, stop training" in content:
-                flag = True
-                assert flag
-        for _, _, filenames in os.walk(checkpoint_dir):
-            for filename in filenames:
-                assert filename.endswith('.ckpt')
-        for _, _, filenames in os.walk(checkpoint_dir):
-            for filename in filenames:
-                if os.path.exists(filename):
-                    os.remove(filename)
diff --git a/tests/st/test_ut/test_models/test_build_config.py b/tests/st/test_ut/test_models/test_build_config.py
deleted file mode 100644
index 4a2d3481ad82545739bcbf8008e652643c14dac6..0000000000000000000000000000000000000000
--- a/tests/st/test_ut/test_models/test_build_config.py
+++ /dev/null
@@ -1,28 +0,0 @@
-# Copyright 2024 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-"""test build config."""
-from mindformers import MindFormerConfig
-from mindformers.models.build_config import build_model_config
-from mindformers.models.llama import LlamaConfig
-
-
-class TestBuildModelConfig:
-    """A test class for testing build_model_config() method."""
-
-    def test_build_llama_config(self):
-        """test build llama config from yaml."""
-        config = MindFormerConfig("research/llama3_1/llama3_1_8b/finetune_llama3_1_8b.yaml")
-        model_config = build_model_config(config.model.model_config)
-        assert isinstance(model_config, LlamaConfig)