diff --git a/docs/vllm_mindspore/docs/source_en/faqs/faqs.md b/docs/vllm_mindspore/docs/source_en/faqs/faqs.md index 8e993fbaa379c44e8aa53fa866e5badf13e335bb..2f0b25d3a55a20b91a61e7420446030248e88d8a 100644 --- a/docs/vllm_mindspore/docs/source_en/faqs/faqs.md +++ b/docs/vllm_mindspore/docs/source_en/faqs/faqs.md @@ -27,29 +27,6 @@ ## Deployment-related Issues -### Model Fails to Load During Offline/Online Inference - -- Key error message: - - ```text - raise ValueError(f"{config.load_checkpoint} is not a valid path to load checkpoint ") - ``` - -- Solution: - 1. Check if the model path exists and is valid; - 2. If the model path exists and the model files are in `safetensors` format, confirm whether the YAML file contains the `load_ckpt_format: "safetensors"` field: - 1. Print the path of the YAML file used by the model: - - ```bash - echo $MINDFORMERS_MODEL_CONFIG - ``` - - 2. Check the YAML file. If the `load_ckpt_format` field is missing, add it: - - ```text - load_ckpt_format: "safetensors" - ``` - ### `aclnnNonzeroV2` Related Error When Starting Online Inference - Key error message: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md index 30ed00c205a9599bdfcd4a82019da7a82834c071..30a0bd0e77d50147d419710e5b93629e185f65ec 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md @@ -135,18 +135,6 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo bash install_depend_pkgs.sh ``` - Compile and install vLLM-MindSpore Plugin: - - ```bash - pip install . - ``` - - After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables: - - ```bash - export PYTHONPATH=$MF_PATH:$PYTHONPATH - ``` - - **vLLM-MindSpore Plugin Manual Installation** If users require custom modifications to dependent components such as vLLM, MindSpore, Golden Stick, or MSAdapter, they can prepare the modified installation packages locally and perform manual installation in a specific sequence. The installation sequence requirements are as follows: @@ -169,11 +157,10 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo pip install /path/to/mindspore-*.whl ``` - 4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH` + 4. Install MindSpore Transformers ```bash - git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=$MF_PATH:$PYTHONPATH + pip install /path/to/mindformers-*.whl ``` 5. Install Golden Stick @@ -204,7 +191,6 @@ User can verify the installation with a simple offline inference test. First, us ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` About environment variables above, user can also refer to [environment variables section](../quick_start/quick_start.md#setting-environment-variables) for more details. diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md index a5e23d9cf2a34bcaef5fa566adde00dc22185da1..3d445a1c54e903c155874682635434263c3615b5 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md @@ -132,19 +132,11 @@ Before launching the model, user needs to set the following environment variable ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these environment variables: - `VLLM_MS_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM-MindSpore Plugin in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). - -Additionally, users need to ensure that MindSpore Transformers is installed. Users can introduce MindSpore Transformers through the following methods: - -```bash -export PYTHONPATH=/path/to/mindformers:$PYTHONPATH -``` ### Offline Inference @@ -193,10 +185,10 @@ vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API p Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command: ```bash -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" +vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct ``` -User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: +User can also pass the local model path to `vllm-mindspore serve` as model tag. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -218,7 +210,7 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` -User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. +User needs to ensure that the `"model"` field matches the model tag in the service startup, and the request can successfully match the model. If the request is processed successfully, the following inference result will be returned: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index fc4154fb44932c5ea64fbe86d45fb7a50fc4811c..c3f7933c329c4ee167145b604f167039074e6f4c 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -115,12 +115,6 @@ Environment variable descriptions: - `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check. - `VLLM_MS_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -Additionally, users need to ensure that MindSpore Transformers is installed. Users can introduce MindSpore Transformers through the following methods: - -```bash -export PYTHONPATH=/path/to/mindformers:$PYTHONPATH -``` - ### Starting Ray for Multi-Node Cluster Management On Ascend, the pyACL package must be installed to adapt Ray. Additionally, the CANN dependency versions on all nodes must be consistent. @@ -212,7 +206,7 @@ vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol. ```bash # Service launch parameter explanation vllm-mindspore serve - --model=[Model Config/Weights Path] + [Model Tag: Config/Weights Path] --quantization [Source of weight quantification] # golden-stick/ascend are optional, respectively indicating that the quantified weights come from the golden-stick or modelslim quantification tools --trust-remote-code # Use locally downloaded model files --max-num-seqs [Maximum Batch Size] @@ -227,10 +221,10 @@ Execution example: ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray ``` -In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument. +In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can pass the local model path by `--model` argument. #### Sending Requests @@ -279,9 +273,9 @@ Environment variable descriptions: `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service: ```bash -# Parameter explanations for service launch +# Parameter explanations for service launch vllm-mindspore serve - --model=[Model Config/Weights Path] + [Model Tag: Config/Weights Path] --quantization [Source of weight quantification] # golden-stick/ascend are optional, respectively indicating that the quantified weights come from the golden-stick or modelslim quantification tools --trust-remote-code # Use locally downloaded model files --max-num-seqs [Maximum Batch Size] @@ -302,14 +296,14 @@ vllm-mindspore serve `data-parallel-size` and `tensor-parallel-size` specify the parallel policies for the attn and ffn-dense parts, and `expert_parallel` specifies the parallel policies for the routing experts in the MOE part. And it must satisfy that `data-parallel-size * tensor-parallel-size` is divisible by `expert_parallel`. -User can also set the local model path by `--model` argument. The following is an execution example: +User can also set the local model path as model tag. The following is an execution example: ```bash # Master node: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' # Worker node: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}' ``` #### Sending Requests diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 96cc8815155e55269060ae3d0f41945a063f8762..c66ea062c8940a4069ec550b761d11dda098ef76 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -128,13 +128,11 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi ```bash #set environment variables export VLLM_MS_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file. ``` Here is an explanation of these environment variables: - `VLLM_MS_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml). Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used): @@ -153,10 +151,10 @@ Use the model `Qwen/Qwen2.5-32B-Instruct` and start the vLLM service with the fo ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` -Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument. +Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path as model tag. If the service starts successfully, similar output will be obtained: diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index 56aefd5c88c99dc4009e693871face9632ba01c0..906a501be56fab89ffaaa986855ff726c3270f03 100644 --- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -128,19 +128,16 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following ```bash #set environment variables export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Here is an explanation of these variables: - `VLLM_MS_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md). -- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml). User can check memory usage with `npu-smi info` and set the compute card for inference using: ```bash -export NPU_VISIBLE_DEVICES=0 -export ASCEND_RT_VISIBLE_DEVICES=$NPU_VISIBLE_DEVICES +export ASCEND_RT_VISIBLE_DEVICES=0 ``` ## Offline Inference @@ -189,10 +186,10 @@ vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API p Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command: ```bash -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" +vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct ``` -User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained: +User can also set the local model path as model tag. If the service starts successfully, similar output will be obtained: ```text INFO: Started server process [6363] @@ -214,7 +211,7 @@ Use the following command to send a request, where `prompt` is the model input: curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}' ``` -User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model. +User needs to ensure that the `"model"` field matches the model tag in the service startup, and the request can successfully match the model. If the request is processed successfully, the following inference result will be returned: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md index b0ac4423ea87a405564af064e752c07f01b8d9d6..6b41bb4cf5b9859f6f0eafb3cb58cd5299f2a1e4 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md @@ -4,8 +4,7 @@ | Environment Variable | Function | Type | Values | Description | |----------------------|----------|------|--------|-------------| -| `VLLM_MS_MODEL_BACKEND` | Used to specify the model backend. If this variable is not set, the backend will be automatically selected in the priority order: MindFormers > Native > MindONE; if set, the specified backend will be used. | String | `MindFormers`: Model backend is MindSpore Transformers. `Native`: Model backend is Native. `MindONE`: Model backend is MindONE. | The native model backend currently supports the Qwen2.5, Qwen2.5VL, Qwen3 and Llama series; the MindSpore Transformers backend supports Qwen, DeepSeek and TeleChat models. When using MindSpore Transformers, set the environment variable: `export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`. | -| `MINDFORMERS_MODEL_CONFIG` | Configuration file for MindSpore Transformers models. Required for Qwen2.5 series or DeepSeek series models. | String | Path to the model configuration file | **This environment variable will be removed in future versions.** Example: `export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`. | +| `VLLM_MS_MODEL_BACKEND` | Used to specify the model backend. If this variable is not set, the backend will be automatically selected in the priority order: MindFormers > Native > MindONE; if set, the specified backend will be used. | String | `MindFormers`: Model backend is MindSpore Transformers. `Native`: Model backend is Native. `MindONE`: Model backend is MindONE. | The native model backend currently supports the Qwen2.5, Qwen2.5VL, Qwen3 and Llama series; the MindSpore Transformers backend supports Qwen, DeepSeek and TeleChat models. | | `GLOO_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using gloo. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `TP_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using TP. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. | diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md index b5fb97401a5008346e66cf72fb905aeb2bd4379a..a3de87028fb0331f790c4a1dad1d04fcf75674ed 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md @@ -10,13 +10,12 @@ For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` then start the online inference with the following command: ```bash -vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests +vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests ``` For multi-card inference, we take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example. You can prepare the environment by following the guide [Multi-Card Inference (Qwen2.5-32B)](../../../getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md#online-inference), then start the online inference with the following command: @@ -24,7 +23,7 @@ For multi-card inference, we take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` If the service is successfully started, the following inference result will be returned: @@ -103,7 +102,6 @@ For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` Clone the vLLM repository and import the vLLM-MindSpore plugin to reuse the benchmark tools: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md index b3624353142ed96abf156a9891a275851bac7eec..5499334eae14af19bc649b9090ae63199e690dd1 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md @@ -17,7 +17,7 @@ After setting the variable, run the following command to launch the vLLM-MindSpo ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` If the service starts successfully, you will see output similar to the following, indicating that the `start_profile` and `stop_profile` requests are being monitored: diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md index 22b668f57504d185ae21b8824968b7a68c4b1899..3e8c9afa944d1bc569ae8a70211cafd160d59595 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md @@ -28,11 +28,8 @@ Refer to the [Installation Guide](../../../getting_started/installation/installa ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` -For the yaml for the DeepSeek-R1 W8A8 quantization inference, user can use [predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml). - Once ready, use the following Python code for offline inference: ```python diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md index 02158ab389b166a4b298e38ab517e06382a4fb3d..5765955861b6970ee61a174b3ad290c875ceff22 100644 --- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md @@ -15,6 +15,5 @@ | QwQ-32B | Testing | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | | Llama3.1 | Testing | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | | Llama3.2 | Testing | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | -| DeepSeek-V2 | Testing | [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) | Note: refer to [Environment Variable List](../../environment_variables/environment_variables.md), and set the model backend by environment variable `VLLM_MS_MODEL_BACKEND`. diff --git a/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md b/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md index dc1095c451fdf2f2daebabc278ae1934fa8ff546..35ab04969c653fb95d60dd14881c033a1de4ef3b 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md @@ -27,29 +27,6 @@ ## 部署相关问题 -### 离线或在线推理时,报模型无法加载 - -- 错误关键信息: - - ```text - raise ValueError(f"{config.load_checkpoint} is not a valid path to load checkpoint ") - ``` - -- 解决思路: - 1. 检查模型路径是否存在且合法; - 2. 若模型路径存在,且其中的模型文件为`safetensors`格式,则需要确认YAML文件中,是否已含有`load_ckpt_format: "safetensors"`字段; - 1. 打印模型所使用的YAML文件路径: - - ```bash - echo $MINDFORMERS_MODEL_CONFIG - ``` - - 2. 查看该YAML文件,若不存在`load_ckpt_format`字段,则添加该字段: - - ```text - load_ckpt_format: "safetensors" - ``` - ### 拉起在线推理时,报`aclnnNonzeroV2`相关错误 - 错误关键信息: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md index fad57bcffca35892649f5f574cd93b9cee1b53b1..425933888b3120eb81d77b66af30ce60c23bee8a 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md @@ -141,12 +141,6 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 pip install . ``` - 上述命令执行完毕之后,将在`vllm-mindspore/install_depend_pkgs`目录下生成`mindformers`文件夹,将其加入到环境变量中: - - ```bash - export PYTHONPATH=$MF_PATH:$PYTHONPATH - ``` - - **vLLM-MindSpore插件手动安装** 若用户对依赖的vLLM、MindSpore、Golden Stick、MSAdapter等组件有自定义修改的需求,可以在本地准备好修改后的安装包,按照特定的顺序进行手动安装。安装顺序要求如下: @@ -169,11 +163,10 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 pip install /path/to/mindspore-*.whl ``` - 4. 引入MindSpore Transformers仓库,加入到`PYTHONPATH`中 + 4. 安装MindSpore Transformers ```bash - git clone https://gitee.com/mindspore/mindformers.git - export PYTHONPATH=$MF_PATH:$PYTHONPATH + pip install /path/to/mindformers-*.whl ``` 5. 安装Golden Stick @@ -204,7 +197,6 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速 ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 关于环境变量的具体含义,可参考[这里](../quick_start/quick_start.md#设置环境变量)。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md index aaea0cbcf30805114896cbce394587004e886d41..502796a55d8cfdf5afe9cb07417b7773282f65e3 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md @@ -132,19 +132,11 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的YAML文件。以Qwen2.5-7B为例,其YAML文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 - -另外,用户需要确保MindSpore Transformers已安装。用户可通过以下方式引入MindSpore Transformers: - -```bash -export PYTHONPATH=/path/to/mindformers:$PYTHONPATH -``` +- `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询。 ### 离线推理 @@ -193,10 +185,10 @@ vLLM-MindSpore插件可使用OpenAI的API协议,进行在线推理部署。以 使用模型`Qwen/Qwen2.5-7B-Instruct`,执行如下命令启动vLLM服务: ```bash -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" +vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct ``` -用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功启动,则可以获得类似的执行结果: +用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -218,7 +210,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中的`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: +其中,用户需确认`"model"`字段与启动服务中的模型标签一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md index 0c5101341220475c1677b3446a0b67bb921aa97c..7010bbfb27c35e4541b4e597ba261db28725240f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md @@ -143,7 +143,6 @@ export HCCL_OP_EXPANSION_MODE=AIV export MS_ALLOC_CONF=enable_vmm:true export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export VLLM_MS_MODEL_BACKEND=MindFormers -export PYTHONPATH=/path/to/mindformers:$PYTHONPATH export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python export GLOO_SOCKET_IFNAME=enp189s0f0 export HCCL_SOCKET_IFNAME=enp189s0f0 @@ -157,7 +156,6 @@ export TP_SOCKET_IFNAME=enp189s0f0 - `MS_ALLOC_CONF`:设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。 - `ASCEND_RT_VISIBLE_DEVICES`:配置每个节点可用的设备ID。用户可使用`npu-smi info`命令进行查询。 - `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `PYTHONPATH`:将MindSpore Transformers路径,加入到`PYTHONPATH`。当`VLLM_MS_MODEL_BACKEND`设置为`MindFormers`需要配置。 - `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION`:当版本不兼容时使用。 - `GLOO_SOCKET_IFNAME`:GLOO后端端口,用于多机之间使用gloo通信时的网口名称。可通过`ifconfig`查找IP对应网卡的网卡名。 - `HCCL_SOCKET_IFNAME`:配置HCCL端口,用于多机之间使用HCCL通信时的网口名称。可通过`ifconfig`查找IP对应网卡的网卡名。 @@ -172,7 +170,7 @@ vLLM-MindSpore插件可使用OpenAI的API协议,部署在线推理。以下是 ```bash # 启动配置参数说明 vllm-mindspore serve - --model=[模型Config/权重路径] + [模型标签:模型Config/权重路径] --trust-remote-code # 使用本地下载的model文件 --max-num-seqs [最大Batch数] --max-model-len [模型上下文长度] @@ -191,14 +189,14 @@ vllm-mindspore serve --addition-config # 并行功能与额外配置 ``` -- 用户可以通过`--model`参数,指定模型保存的本地路径; +- 用户可以通过指定模型保存的本地路径为模型标签; - 用户可以通过`--addition-config`参数,配置并行与其他功能。 以下为Ray启动命令: ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray ``` 关于multiprocess启动命令,可以参考[multiprocess启动方式](../../../user_guide/supported_features/parallel/parallel.md#启动服务)。 @@ -211,7 +209,7 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 120, "temperature": 0}' ``` -用户需确认`"model"`字段与启动服务中的`--model`一致,请求才能成功匹配到模型。 +用户需确认`"model"`字段与启动服务中的模型标签一致,请求才能成功匹配到模型。 ## 附录 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md index 72d474c1c260794fe9a3f8a52172cbc5ea1981de..9a4f28ba706c0e9a2921e3a90679de7ae9a7cbd9 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md @@ -128,13 +128,11 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct ```bash #set environment variables export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: - `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的YAML文件。以Qwen2.5-32B为例,其YAML文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml)。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡。以下例子假设用户使用4、5、6、7卡进行推理: @@ -153,12 +151,12 @@ vLLM-MindSpore插件可使用OpenAI的API协议,部署在线推理。以下以 ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` 其中,`TENSOR_PARALLEL_SIZE`为用户指定的卡数,`MAX_MODEL_LEN`为模型最大输出token数。 -用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功启动,则可以获得类似的执行结果: +用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动,则可以获得类似的执行结果: ```text INFO: Started server process [6363] @@ -180,7 +178,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}' ``` -其中,用户需确认`"model"`字段与启动服务中的`--model`一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: +其中,用户需确认`"model"`字段与启动服务中的模型标签一致,请求才能成功匹配到模型。若请求处理成功,将获得以下推理结果: ```text { diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md index c629a9302617862e9e1d4c567228d00c642fdc1e..eb1d75640c69452dd49789be4e081d687b7df66f 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md @@ -128,19 +128,16 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ```bash #set environment variables export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 以下是对上述环境变量的解释: -- `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询; -- `MINDFORMERS_MODEL_CONFIG`:模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中,找到对应模型的YAML文件。以Qwen2.5-7B为例,其YAML文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。 +- `VLLM_MS_MODEL_BACKEND`:所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端,可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。 用户可通过`npu-smi info`查看显存占用情况,并可以使用如下环境变量,设置用于推理的计算卡: ```bash -export NPU_VISIBLE_DEVICES=0 -export ASCEND_RT_VISIBLE_DEVICES=$NPU_VISIBLE_DEVICES +export ASCEND_RT_VISIBLE_DEVICES=0 ``` ## 离线推理 @@ -190,10 +187,10 @@ vLLM-MindSpore插件可使用OpenAI的API协议,部署在线推理。以下以 使用如下命令启动vLLM服务: ```bash -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct" +vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct ``` -用户可以通过`--model`参数,指定模型保存的本地路径。若服务成功启动,则可以获得类似的执行结果: +用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动,则可以获得类似的执行结果: ```text INFO: Started server process [6363] diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md index a08a042409d505152e700f7f44e107bd54acdd75..b5d0a2b93418600f9d0519f42937b67df93fa4d8 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md @@ -4,8 +4,7 @@ | 环境变量 | 功能 | 类型 | 取值 | 说明 | | ------ | ------- | ------ | ------ | ------ | -| `VLLM_MS_MODEL_BACKEND` | 用于指定模型后端。如果不配置变量,会按照 MindFormers > 原生模型 > MindONE 的优先级自动寻找支持的后端。配置之后则按指定后端执行。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 `Native`: 模型后端为原生模型。 `MindONE`: 模型后端为MindONE | 原生模型后端当前支持Qwen2.5、Qwen2.5VL、Qwen3、Llama系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、TeleChat系列模型,使用时需配置环境变量:`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。 | -| `MINDFORMERS_MODEL_CONFIG` | MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时,需要配置文件路径。 | String | 模型配置文件路径。 | **该环境变量在后续版本会被移除。** 样例:`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。 | +| `VLLM_MS_MODEL_BACKEND` | 用于指定模型后端。如果不配置变量,会按照 MindFormers > 原生模型 > MindONE 的优先级自动寻找支持的后端。配置之后则按指定后端执行。 | String | `MindFormers`: 模型后端为MindSpore Transformers。 `Native`: 模型后端为原生模型。 `MindONE`: 模型后端为MindONE | 原生模型后端当前支持Qwen2.5、Qwen2.5VL、Qwen3、Llama系列;MindSpore Transformers模型后端支持Qwen系列、DeepSeek、TeleChat系列模型。 | | `GLOO_SOCKET_IFNAME` | 用于多机之间使用gloo通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找IP对应网卡的网卡名。 | | `TP_SOCKET_IFNAME` | 用于多机之间使用TP通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找IP对应网卡的网卡名。 | | `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称,例如enp189s0f0。 | 多机场景使用,可通过`ifconfig`查找IP对应网卡的网卡名。 | diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md index ac99837d47c5ddd5673d9636144d387309b65403..00de8a17ec5261a1f451d0d83306bb0c2f4f62d5 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md @@ -10,7 +10,6 @@ vLLM-MindSpore插件的性能测试能力,继承自vLLM所提供的性能测 ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 使用以下命令启动在线推理: @@ -24,7 +23,7 @@ vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-reques ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` 当返回以下日志时,则服务已成功拉起: @@ -103,7 +102,6 @@ P99 ITL (ms): .... ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` 并拉取vLLM代码仓库,导入vLLM-MindSpore插件,复用其中的benchmark功能: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md index 12dee2759bd7e3b6434cd5687d2ca1cb55107fdd..794304e170897e75ef97ca169ad110e3f1298de9 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md @@ -21,7 +21,7 @@ vLLM-MindSpore插件支持张量并行(TP)、数据并行(DP)、专家 ```bash TENSOR_PARALLEL_SIZE=4 # TP 并行数 -vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} +vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} ``` ### 多机示例 @@ -34,7 +34,7 @@ vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor # 主节点: TENSOR_PARALLEL_SIZE=4 # TP 并行数 -vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} +vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} ``` ## 数据并行 @@ -65,7 +65,7 @@ Ray在多机场景中可以简化启动,是推荐的启动方式,请参考[R DATA_PARALLEL_SIZE=4 # DP 并行数 DATA_PARALLEL_SIZE_LOCAL=2 # 当前服务节点中的DP数,所有节点求和等于`--data-parallel-size` -vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} --data-parallel-backend=ray +vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} --data-parallel-backend=ray ``` #### multiprocess启动 @@ -74,10 +74,10 @@ vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-p ```bash # 主节点: -vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} +vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} # 从节点: -vllm-mindspore serve --headless --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} +vllm-mindspore serve /path/to/Qwen2.5/model --headless --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} ``` ## 专家并行 @@ -107,7 +107,7 @@ vllm-mindspore serve --headless --model=/path/to/Qwen2.5/model --trust-remote-co 以下命令为单机八卡,启动Qwen-3 MOE的专家并行示例: ```bash -vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} ``` #### 多机示例 @@ -115,7 +115,7 @@ vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-exp 多机专家并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。以下命令为双机四卡,Ray启动Qwen-3 MOE的专家并行示例: ```bash -vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray +vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray ``` ## 混合并行 @@ -129,7 +129,7 @@ vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-exp 可根据上述介绍,分别将三种并行策略的配置叠加,在启动命令`vllm-mindspore serve`中使能。多机混合并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。其叠加后混合并行的Ray启动命令如下: ```bash -vllm-mindspore serve --model="/path/to/DeepSeek-R1" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray ``` ## 附录 @@ -264,7 +264,7 @@ vLLM-MindSpore插件可使用OpenAI的API协议部署在线推理。以下是在 ```bash # 启动配置参数说明 vllm-mindspore serve - --model=[模型Config/权重路径] + [模型标签:模型Config/权重路径] --trust-remote-code # 使用本地下载的model文件 --max-num-seqs [最大Batch数] --max-model-len [模型上下文长度] @@ -283,7 +283,7 @@ vllm-mindspore serve --addition-config # 并行功能与额外配置 ``` -- 用户可以通过`--model`参数,指定模型保存的本地路径。 +- 用户可以通过指定模型保存的本地路径为模型标签; - 用户可以通过`--addition-config`参数,配置并行与其他功能。其中并行可进行如下配置,对应的是DP4-EP4-TP4场景: @@ -297,17 +297,17 @@ vllm-mindspore serve ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' # 从节点: -vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' ``` **Ray启动方式** ```bash # 主节点: -vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray +vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray ``` #### 发送请求 @@ -318,4 +318,4 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}' ``` -用户需确认`"model"`字段与启动服务中的`--model`一致,请求才能成功匹配到模型。 +用户需确认`"model"`字段与启动服务中的模型标签一致,请求才能成功匹配到模型。 diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md index 06421892a7a467ce5c43467dcb52279b31db1539..99b4e673fa80292b6d1fba9b4e09611470b26771 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md @@ -17,7 +17,7 @@ export VLLM_TORCH_PROFILER_DIR=/path/to/save/vllm_profile ```bash export TENSOR_PARALLEL_SIZE=4 export MAX_MODEL_LEN=1024 -python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN +vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN ``` 若服务成功拉起,则可以获得类似的执行结果,且可以从中看到监听`start_profile`和`stop_profile`请求: diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md index 4398b3e67b9a1105bde200e9ea7684baa32924d1..42428355a13d46c3df016ea73dfab646011df456 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md @@ -28,11 +28,8 @@ ```bash export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend. -export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file. ``` -关于DeepSeek-R1 W8A8量化推理的YAML文件,可以使用[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml)。 - 环境准备完成后,用户可以使用如下Python代码,进行离线推理服务: ```python diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md index fd85db6ea63dbe2bfcf0ba3bf09601a8c0d67ac1..4a2281865b4eff200a33884389bb3b27a319aa93 100644 --- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md +++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md @@ -15,6 +15,5 @@ | QwQ-32B | 测试中 | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | | Llama3.1 | 测试中 | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)、[Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)、[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | | Llama3.2 | 测试中 | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)、[Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | -| DeepSeek-V2 | 测试中 | [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) | 注:用户可参考[环境变量章节](../../environment_variables/environment_variables.md),通过环境变量`VLLM_MS_MODEL_BACKEND`,指定模型后端。