diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 7e21412161ea72cca01a5cc48021389307f09a2e..06e0eebb0311627eb32585e6524aa8cb1d1066b4 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -11,17 +11,21 @@ This document will introduce the [Version Matching](#version-compatibility) of v - OS: Linux-aarch64 - Python: 3.9 / 3.10 / 3.11 -- Software version compatibility +- Depent Software version compatibility | Software | Version And Links | | ----- | ----- | - | CANN | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | - | MindSpore | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | - | MSAdapter | [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | - | MindSpore Transformers | [1.6.0](https://gitee.com/mindspore/mindformers) | - | Golden Stick | [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - | vLLM-MindSpore Plugin | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + | CANN | [8.3.RC1](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/index/index.html) | + | MindSpore | [2.7.1](https://www.mindspore.cn/versions#2.7.1) | + | MSAdapter| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202510/20251011/r0.3.0_20251011095813_951a8218d4c29785e48f304e720212b57056573e_newest/) | + | MindSpore Transformers | [1.7.0](https://www.mindspore.cn/mindformers/docs/en/r1.7.0) | + | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/) | + +- Source code and download link of vLLM-MindSpore Plugin + + | Source Code Link | Package Link | + | ----- | ----- | + | [0.4.0](https://gitee.com/mindspore/vllm-mindspore/tree/r0.4.0/) | [Python3.9](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp39-cp39-linux_aarch64.whl),[Python3.10](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp310-cp310-linux_aarch64.whl),[Python3.11](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp311-cp311-linux_aarch64.whl) | ## Docker Installation @@ -116,7 +120,7 @@ docker exec -it $DOCKER_NAME bash ### CANN Installation -For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. +For CANN installation methods and environment configuration, please refer to [CANN Community Edition Installation Guide](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/softwareinst/instg/instg_quick.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit). If you encounter any issues during CANN installation, please consult the [Ascend FAQ](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html) for troubleshooting. The default installation path for CANN is `/usr/local/Ascend`. After completing CANN installation, configure the environment variables with the following commands: @@ -128,11 +132,7 @@ export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit ### vLLM Prerequisites Installation -For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). In vLLM installation, `gcc/g++ >= 12.3.0` is required, and it can be installed by the following command: - -```bash -yum install -y gcc gcc-c++ -``` +For vLLM environment configuration and installation methods, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html). ### vLLM-MindSpore Plugin Installation @@ -148,9 +148,17 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo bash install_depend_pkgs.sh ``` + Compile and install vLLM-MindSpore Plugin: + + ```bash + pip install . + ``` + + User can also refer to [Version Compatibility](#version-compatibility), check the Python version, download vLLM-Mindspore Pulgin whl package, and use pip to install. + - **vLLM-MindSpore Plugin Manual Installation** - If users require custom modifications to dependent components such as vLLM, MindSpore, Golden Stick, or MSAdapter, they can prepare the modified installation packages locally and perform manual installation in a specific sequence. The installation sequence requirements are as follows: + If users require custom modifications to dependent components such as vLLM, MindSpore, or MSAdapter, they can prepare the modified installation packages locally and perform manual installation in a specific sequence. The installation sequence requirements are as follows: 1. Install vLLM @@ -158,37 +166,25 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo pip install /path/to/vllm-*.whl ``` - 2. Uninstall Torch-related components - - ```bash - pip uninstall torch torch-npu torchvision torchaudio -y - ``` - - 3. Install MindSpore + 2. Install MindSpore ```bash pip install /path/to/mindspore-*.whl ``` - 4. Install MindSpore Transformers + 3. Install MindSpore Transformers ```bash pip install /path/to/mindformers-*.whl ``` - 5. Install Golden Stick - - ```bash - pip install /path/to/mindspore_gs-*.whl - ``` - - 6. Install MSAdapter + 4. Install MSAdapter ```bash pip install /path/to/msadapter-*.whl ``` - 7. Install vLLM-MindSpore Plugin + 5. Install vLLM-MindSpore Plugin User needs to pull source of vLLM-MindSpore Plugin, and run installation. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 26739499c0c99f2f29bf766f425ee7edacebb125..2cb0068da71a6d49f2df6a5f329ad2281bcdd251 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -178,7 +178,7 @@ vllm-mindspore serve --block-size [Block Size, recommended 128] --gpu-memory-utilization [Memory utilization rate, recommended 0.9] --tensor-parallel-size [TP parallelism degree] - --headless # Only needed for worker nodes, indicates no server-side related content is needed + --headless # Run in headless mode, used in multi-node data parallel --data-parallel-size [DP parallelism degree] --data-parallel-size-local [Number of DP workers on the current service node. The sum across all nodes equals data-parallel-size] --data-parallel-start-rank [The offset of the first DP worker responsible for the current service node, used when using the multiprocess startup method] @@ -196,7 +196,7 @@ The following is the Ray startup command: ```bash # Master Node: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray --quantization ascend ``` For the multiprocess startup command, please refer to the [Multiprocess Startup Method](../../../user_guide/supported_features/parallel/parallel.md#starting-the-service). @@ -224,19 +224,19 @@ pyACL (Python Ascend Computing Language) wraps the corresponding API interfaces In the target environment, after obtaining the appropriate version of the Ascend-cann-nnrt installation package, extract the pyACL dependency package and install it separately. Then add the installation path to the environment variables: ```bash -./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ +./Ascend-cann-nnrt_*_linux-aarch64.run --noexec --extract=./ cd ./run_package -./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= +./Ascend-pyACL_*_linux-aarch64.run --full --install-path= export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH ``` If there are permission issues during installation, use the following command to add permissions: ```bash -chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run +chmod -R 777 ./Ascend-pyACL_*_linux-aarch64.run ``` -The Ascend runtime package can be downloaded from the Ascend homepage. For example, you can download the runtime package for version [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1). +The Ascend runtime package can be downloaded from the Ascend homepage. For example, you can refer to [installation](../../installation/installation.md) and download the runtime package. #### Multi-Node Cluster diff --git a/docs/vllm_mindspore/docs/source_en/index.rst b/docs/vllm_mindspore/docs/source_en/index.rst index 92c20151150f8cfe9d2be6283fc4c77ba69d1bfb..f86314487f92f2195fb19a44b483f3f26db04159 100644 --- a/docs/vllm_mindspore/docs/source_en/index.rst +++ b/docs/vllm_mindspore/docs/source_en/index.rst @@ -80,11 +80,15 @@ The following are the version branches: - Unmaintained - Only doc fixed is allowed * - r0.2 - - Maintained - - Compatible with vLLM v0.7.3, and CI commitment for MindSpore 2.6.0 + - Unmaintained + - Only doc fixed is allowed + - Compatible with vLLM v0.7.3, and CI commitment for MindSpore 2.6.0. Only doc fixed is allowed * - r0.3.0 - - Maintained + - Unmaintained - Compatible with vLLM v0.8.3, and CI commitment for MindSpore 2.7.0 + * - r0.4.0 + - Maintained + - Compatible with vLLM v0.9.1, and CI commitment for MindSpore 2.7.1 SIG ----------------------------------------------------- diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/parallel/parallel.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/parallel/parallel.md index ea6aeb84953b84fbecc1f64523b587793d950796..8d9b682cfd1a7c7fd649765a04512e98d2d54e93 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/parallel/parallel.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/parallel/parallel.md @@ -107,7 +107,7 @@ To use Expert Parallelism (EP), configure the following options in the launch co The following command is an example of launching Expert Parallelism for Qwen-3 MOE on a single node with eight cards: ```bash -vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8} ``` #### Multi-Node Example @@ -115,7 +115,7 @@ vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-para Multi-node Expert Parallelism relies on Ray for launch. Please refer to [Ray Multi-Node Cluster Management](#ray-multi-node-cluster-management) for Ray environment configuration. The following command is an example of launching Expert Parallelism for Qwen-3 MOE across two nodes with eight cards total using Ray: ```bash -vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8} --data-parallel-backend ray ``` ## Hybrid Parallelism @@ -129,7 +129,7 @@ Users can flexibly combine and adjust parallel strategies based on the model use Based on the introductions above, the configurations for the three parallel strategies can be combined and enabled in the `vllm-mindspore serve` launch command. Multi-node Hybrid Parallelism relies on Ray for launch. Please refer to [Ray Multi-Node Cluster Management](#ray-multi-node-cluster-management) for Ray environment configuration. The combined Ray launch command for Hybrid Parallelism is as follows: ```bash -vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray ``` ## Appendix @@ -145,19 +145,19 @@ pyACL (Python Ascend Computing Language) wraps the corresponding API interfaces In the target environment, after obtaining the appropriate version of the Ascend-cann-nnrt installation package, extract the pyACL dependency package and install it separately. Then add the installation path to the environment variables: ```bash -./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ +./Ascend-cann-nnrt_*_linux-aarch64.run --noexec --extract=./ cd ./run_package -./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= +./Ascend-pyACL_*_linux-aarch64.run --full --install-path= export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH ``` If there are permission issues during installation, use the following command to add permissions: ```bash -chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run +chmod -R 777 ./Ascend-pyACL_*_linux-aarch64.run ``` -The Ascend runtime package can be downloaded from the Ascend homepage. For example, you can download the runtime package for version [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1). +The Ascend runtime package can be downloaded from the Ascend homepage. For example, you can refer to [installation](../../installation/installation.md) and download the runtime package. #### Multi-Node Cluster @@ -284,22 +284,17 @@ vllm-mindspore serve ``` - Users can specify the local path where the model is saved as the model tag. -- Users can configure parallelism and other features using the `--additional-config` parameter. Parallelism can be configured as follows, corresponding to a DP4-EP4-TP4 scenario: - ```bash - --additional-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' - ``` - -The following are execution examples for the multiprocess and Ray startup methods respectively: +The following are execution examples for the multiprocess and Ray startup methods respectively, taking DP4-EP4-TP4 as example: **Multiprocess Startup Method** ```bash # Master Node: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend # Worker Node: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --headless --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend ``` Specifically, `data-parallel-address` and `--data-parallel-rpc-port` must be configured with the actual environment information for the running instance. @@ -308,7 +303,7 @@ Specifically, `data-parallel-address` and `--data-parallel-rpc-port` must be con ```bash # Master Node: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray --quantization ascend ``` #### Sending Requests diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 3e8c9afa944d1bc569ae8a70211cafd160d59595..dffc241ca45d509a45a64713e125837a17fb6a41 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -8,7 +8,7 @@ In this document, the [Creating Quantized Models](#creating-quantized-models) se ## Creating Quantized Models -We use the [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) network as an example to introduce W8A8 quantization with the OutlierSuppressionLite algorithm. +We use the [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) network as an example to introduce W8A8 quantization with the OutlierSuppressionLite algorithm. This chapter requires the MindSpore Golden Stick module. Please refer to [here](https://www.mindspore.cn/golden_stick/docs/en/master/index.html) for details about this module. ### Quantizing Networks with MindSpore Golden Stick diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index b6bb57f3d8cdc2c803ffb6fbd50ca180a35903e6..2ae612a32149e77f6fc3c85036c4a39e5f3f3897 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -11,17 +11,21 @@ - OS:Linux-aarch64 - Python:3.9 / 3.10 / 3.11 -- 软件版本配套 +- 依赖软件版本配套 | 软件 | 配套版本与下载链接 | | ----- | ----- | - | CANN | [8.1.RC1](https://www.hiascend.com/document/detail/zh/canncommercial/81RC1/softwareinst/instg/instg_0000.html?Mode=PmIns&InstallType=local&OS=Debian&Software=cannToolKit) | - | MindSpore | [2.7.0](https://repo.mindspore.cn/mindspore/mindspore/version/202508/20250814/master_20250814091143_7548abc43af03319bfa528fc96d0ccd3917fcc9c_newest/unified/) | - | MSAdapter| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202508/20250814/master_20250814010018_4615051c43eef898b6bbdc69768656493b5932f8_newest/any/) | - | MindSpore Transformers | [1.6.0](https://gitee.com/mindspore/mindformers) | - | Golden Stick | [1.2.0](https://repo.mindspore.cn/mindspore/golden-stick/version/202508/20250814/master_20250814010017_2713821db982330b3bcd6d84d85a3b337d555f27_newest/any/) | - | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/any/) | - | vLLM-MindSpore插件 | [0.3.0](https://gitee.com/mindspore/vllm-mindspore/) | + | CANN | [8.3.RC1](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/index/index.html) | + | MindSpore | [2.7.1](https://www.mindspore.cn/versions#2.7.1) | + | MSAdapter| [0.5.0](https://repo.mindspore.cn/mindspore/msadapter/version/202510/20251011/r0.3.0_20251011095813_951a8218d4c29785e48f304e720212b57056573e_newest/) | + | MindSpore Transformers | [1.7.0](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.7.0) | + | vLLM | [0.9.1](https://repo.mindspore.cn/mirrors/vllm/version/202507/20250715/v0.9.1/) | + +- vLLM-MindSpore插件代码仓与下载链接 + + |代码仓链接 | 插件包下载链接 | + | ----- | ----- | + | [0.4.0](https://gitee.com/mindspore/vllm-mindspore/tree/r0.4.0/) | [Python3.9](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp39-cp39-linux_aarch64.whl),[Python3.10](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp310-cp310-linux_aarch64.whl),[Python3.11](https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.7.1/VllmMindSpore/ascend/aarch64/vllm_mindspore-0.4.0-cp311-cp311-linux_aarch64.whl) | ## docker安装 @@ -116,7 +120,7 @@ docker exec -it $DOCKER_NAME bash ### CANN安装 -CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/softwareinst/instg/instg_0001.html?Mode=PmIns&OS=openEuler&Software=cannToolKit)。若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 +CANN安装方法与环境配套,请参考[CANN社区版软件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/83RC1/softwareinst/instg/instg_quick.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit)。若用户在安装CANN过程中遇到问题,可参考[昇腾常见问题](https://www.hiascend.com/document/detail/zh/AscendFAQ/ProduTech/CANNFAQ/cannfaq_000.html)进行解决。 CANN默认安装路径为`/usr/local/Ascend`。用户在安装CANN完毕后,使用如下命令,为CANN配置环境变量: @@ -128,11 +132,7 @@ export ASCEND_CUSTOM_PATH=${LOCAL_ASCEND}/ascend-toolkit ### vLLM前置依赖安装 -vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。其依赖`gcc/g++ >= 12.3.0`版本,可通过以下命令完成安装: - -```bash -yum install -y gcc gcc-c++ -``` +vLLM的环境配置与安装方法,请参考[vLLM安装教程](https://docs.vllm.ai/en/v0.9.1/getting_started/installation/cpu.html)。 ### vLLM-MindSpore插件安装 @@ -154,9 +154,11 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 pip install . ``` + 也可以参考[软件配套](#版本配套),根据对应的Python版本,下载vLLM-MindSpore插件包,进行pip安装。 + - **vLLM-MindSpore插件手动安装** - 若用户对依赖的vLLM、MindSpore、Golden Stick、MSAdapter等组件有自定义修改的需求,可以在本地准备好修改后的安装包,按照特定的顺序进行手动安装。安装顺序要求如下: + 若用户对依赖的vLLM、MindSpore、MSAdapter等组件有自定义修改的需求,可以在本地准备好修改后的安装包,按照特定的顺序进行手动安装。安装顺序要求如下: 1. 安装vLLM @@ -164,37 +166,25 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 pip install /path/to/vllm-*.whl ``` - 2. 卸载torch相关组件 - - ```bash - pip uninstall torch torch-npu torchvision torchaudio -y - ``` - - 3. 安装MindSpore + 2. 安装MindSpore ```bash pip install /path/to/mindspore-*.whl ``` - 4. 安装MindSpore Transformers + 3. 安装MindSpore Transformers ```bash pip install /path/to/mindformers-*.whl ``` - 5. 安装Golden Stick - - ```bash - pip install /path/to/mindspore_gs-*.whl - ``` - - 6. 安装MSAdapter + 4. 安装MSAdapter ```bash pip install /path/to/msadapter-*.whl ``` - 7. 安装vLLM-MindSpore插件 + 5. 安装vLLM-MindSpore插件 需要先拉取vLLM-MindSpore插件源码,再执行安装: @@ -212,7 +202,7 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. ``` -关于环境变量的具体含义,可参考[这里](../quick_start/quick_start.md#设置环境变量)。 +关于环境变量的具体含义,可参考[环境变量](../quick_start/quick_start.md#设置环境变量)。 用户可以使用如下Python脚本,进行模型的离线推理: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 1bda8b8f4ff9f1d4aafb49b5808434a0d0f5e12e..45d37386d8383ec52a6ba3d7ef5c0e484ea35f4e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -178,7 +178,7 @@ vllm-mindspore serve --block-size [Block Size 大小,推荐128] --gpu-memory-utilization [显存利用率,推荐0.9] --tensor-parallel-size [TP 并行数] - --headless # 仅从节点需要配置,表示不需要服务侧相关内容 + --headless # 启用headless模式,多节点数据并行时使用 --data-parallel-size [DP 并行数] --data-parallel-size-local [当前服务节点中的DP数,所有节点求和等于data-parallel-size] --data-parallel-start-rank [当前服务节点中负责的首个DP的偏移量,当使用multiprocess启动方式时使用] @@ -186,17 +186,17 @@ vllm-mindspore serve --data-parallel-rpc-port [主节点的通讯端口,当使用multiprocess启动方式时使用] --enable-expert-parallel # 使能专家并行 --data-parallel-backend [ray,mp] # 指定 dp 部署方式为 ray 或是 mp(即multiprocess) - --addition-config # 并行功能与额外配置 + --additional-config # 并行功能与额外配置 ``` - 用户可以通过指定模型保存的本地路径为模型标签; -- 用户可以通过`--addition-config`参数,配置并行与其他功能。 +- 用户可以通过`--additional-config`参数,配置并行与其他功能。 以下为Ray启动命令: ```bash # 主节点: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray --quantization ascend ``` 关于multiprocess启动命令,可以参考[multiprocess启动方式](../../../user_guide/supported_features/parallel/parallel.md#启动服务)。 @@ -224,19 +224,19 @@ pyACL(Python Ascend Computing Language)通过 CPython 封装了 AscendCL 对 在对应环境中,获取相应版本的 Ascend-cann-nnrt 安装包后,解压出 pyACL 依赖包并单独安装,并将安装路径添加到环境变量中: ```bash -./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ +./Ascend-cann-nnrt_*_linux-aarch64.run --noexec --extract=./ cd ./run_package -./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= +./Ascend-pyACL_*_linux-aarch64.run --full --install-path= export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH ``` 若安装过程有权限问题,可以使用以下命令加权限: ```bash -chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run +chmod -R 777 ./Ascend-pyACL_*_linux-aarch64.run ``` -在 Ascend 的首页中可以下载 Ascend 运行包。例如,可以下载 [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1) 对应版本的运行包。 +在 Ascend 的首页中可以下载 Ascend 运行包。例如,可以参考[安装指南](../../installation/installation.md)下载运行包。 #### 多节点间集群 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/index.rst b/docs/vllm_mindspore/docs/source_zh_cn/index.rst index 72df51f2b7291a15d2a54795e02f81f3cb5c0d0a..370f66110f8935bcf8efb816f68abeaafd5ea640 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/index.rst +++ b/docs/vllm_mindspore/docs/source_zh_cn/index.rst @@ -80,11 +80,14 @@ vLLM-MindSpore插件代码仓包含主干分支、开发分支、版本分支: - Unmaintained - 仅允许文档修复 * - r0.2 - - Maintained - - 基于vLLM v0.7.3版本和MindSpore 2.6.0版本CI看护 + - Unmaintained + - 基于vLLM v0.7.3版本和MindSpore 2.6.0版本CI看护。仅允许文档修复 * - r0.3.0 - Maintained - - 基于vLLM v0.7.3版本和MindSpore 2.7.0版本CI看护 + - 基于vLLM v0.8.3版本和MindSpore 2.7.0版本CI看护 + * - r0.4.0 + - Maintained + - 基于vLLM v0.9.1版本和MindSpore 2.7.1版本CI看护 SIG组织 ----------------------------------------------------- diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md index 5e84bb709b823c46c0e1c7c1fafe3f0b66dac6fd..f7e2a0e66aff7cc09c82be75ceebf8e68751dd7c 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md @@ -93,7 +93,7 @@ vllm-mindspore serve /path/to/Qwen2.5/model --headless --trust-remote-code --dat - `--additional-config`:配置`expert_parallel`字段为EP并行数。例如配置EP为4,则 ```bash - --addition-config '{"expert_parallel": 4} + --additional-config '{"expert_parallel": 4} ``` > - 如果不配置`--enable-expert-parallel`则不使能EP,配置 `--additional-config '{"expert_parallel": 4}'`不会生效; @@ -107,7 +107,7 @@ vllm-mindspore serve /path/to/Qwen2.5/model --headless --trust-remote-code --dat 以下命令为单机八卡,启动Qwen-3 MOE的专家并行示例: ```bash -vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8} ``` #### 多机示例 @@ -115,7 +115,7 @@ vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-para 多机专家并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。以下命令为双机四卡,Ray启动Qwen-3 MOE的专家并行示例: ```bash -vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --additional-config '{"expert_parallel": 8} --data-parallel-backend ray ``` ## 混合并行 @@ -129,7 +129,7 @@ vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-para 可根据上述介绍,分别将三种并行策略的配置叠加,在启动命令`vllm-mindspore serve`中使能。多机混合并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。其叠加后混合并行的Ray启动命令如下: ```bash -vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray ``` ## 附录 @@ -145,19 +145,19 @@ pyACL (Python Ascend Computing Language) 通过 CPython 封装了 AscendCL 对 在对应环境中,获取相应版本的 Ascend-cann-nnrt 安装包后,解压出 pyACL 依赖包并单独安装,并将安装路径添加到环境变量中: ```bash -./Ascend-cann-nnrt_8.0.RC1_linux-aarch64.run --noexec --extract=./ +./Ascend-cann-nnrt_*_linux-aarch64.run --noexec --extract=./ cd ./run_package -./Ascend-pyACL_8.0.RC1_linux-aarch64.run --full --install-path= +./Ascend-pyACL_*_linux-aarch64.run --full --install-path= export PYTHONPATH=/CANN-/python/site-packages/:$PYTHONPATH ``` 若安装过程有权限问题,可以使用以下命令加权限: ```bash -chmod -R 777 ./Ascend-pyACL_8.0.RC1_linux-aarch64.run +chmod -R 777 ./Ascend-pyACL_*_linux-aarch64.run ``` -在 Ascend 的首页中可以下载 Ascend 运行包。例如,可以下载 [8.0.RC1.beta1](https://www.hiascend.cn/developer/download/community/result?module=cann&version=8.0.RC1.beta1) 对应版本的运行包。 +在 Ascend 的首页中可以下载 Ascend 运行包。例如,可以参考[安装指南](../../installation/installation.md)下载运行包。 #### 多节点间集群 @@ -280,27 +280,21 @@ vllm-mindspore serve --data-parallel-rpc-port [主节点的通讯端口,当使用multiprocess启动方式时使用] --enable-expert-parallel # 使能专家并行 --data-parallel-backend [ray,mp] # 指定 dp 部署方式为 Ray 或是 mp(即multiprocess) - --addition-config # 并行功能与额外配置 + --additional-config # 并行功能与额外配置 ``` - 用户可以通过指定模型保存的本地路径为模型标签; -- 用户可以通过`--addition-config`参数,配置并行与其他功能。其中并行可进行如下配置,对应的是DP4-EP4-TP4场景: - - ```bash - --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' - ``` - -以下分别为multiprocess与Ray两种启动方式的执行示例: +以DP4-EP4-TP4场景为例,以下分别为multiprocess与Ray两种启动方式的执行示例: **multiprocess启动方式** ```bash # 主节点: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend # 从节点: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --headless --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 127.0.0.1 --data-parallel-rpc-port 29550 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --quantization ascend ``` 其中,`data-parallel-address`和`--data-parallel-rpc-port`需要设置成实际运行的环境信息。 @@ -309,7 +303,7 @@ vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --trust-remo ```bash # 主节点: -vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' --data-parallel-backend ray --quantization ascend ``` #### 发送请求 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 42428355a13d46c3df016ea73dfab646011df456..794b383c0a31788c050dcfeec201cc61e538897e 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -8,11 +8,11 @@ ## 创建量化模型 -以[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)网络为例,使用OutlierSuppressionLite算法对其进行W8A8量化。 +以[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)网络为例,使用OutlierSuppressionLite算法对其进行W8A8量化。该章节需依赖MindSpore金箍棒模块,请参考[这里](https://www.mindspore.cn/golden_stick/docs/zh-CN/master/index.html)了解该模块。 ### 使用MindSpore金箍棒量化网络 -我们将使用[MindSpore 金箍棒的PTQ算法](https://gitee.com/mindspore/golden-stick/blob/master/mindspore_gs/ptq/ptq/README_CN.md)对DeepSeek-R1网络进行量化,详细方法参考[DeepSeekR1-OutlierSuppressionLite量化样例](https://gitee.com/mindspore/golden-stick/blob/master/example/deepseekv3/a8w8-osl/readme.md) +我们将使用[MindSpore 金箍棒的PTQ算法](https://gitee.com/mindspore/golden-stick/blob/master/mindspore_gs/ptq/ptq/README_CN.md)对DeepSeek-R1网络进行量化,详细方法参考[DeepSeekR1-OutlierSuppressionLite量化样例](https://gitee.com/mindspore/golden-stick/blob/master/example/deepseekv3/a8w8-osl/readme.md)。 ### 直接下载量化权重