diff --git a/docs/vllm_mindspore/docs/source_en/faqs/faqs.md b/docs/vllm_mindspore/docs/source_en/faqs/faqs.md
index 8e993fbaa379c44e8aa53fa866e5badf13e335bb..2f0b25d3a55a20b91a61e7420446030248e88d8a 100644
--- a/docs/vllm_mindspore/docs/source_en/faqs/faqs.md
+++ b/docs/vllm_mindspore/docs/source_en/faqs/faqs.md
@@ -27,29 +27,6 @@
 
 ## Deployment-related Issues
 
-### Model Fails to Load During Offline/Online Inference
-
-- Key error message:
-
-   ```text
-   raise ValueError(f"{config.load_checkpoint} is not a valid path to load checkpoint ")
-   ```
-
-- Solution:
-  1. Check if the model path exists and is valid;
-  2. If the model path exists and the model files are in `safetensors` format, confirm whether the YAML file contains the `load_ckpt_format: "safetensors"` field:
-     1. Print the path of the YAML file used by the model:
-
-        ```bash
-        echo $MINDFORMERS_MODEL_CONFIG
-        ```
-
-     2. Check the YAML file. If the `load_ckpt_format` field is missing, add it:
-
-        ```text
-        load_ckpt_format: "safetensors"
-        ```
-
 ### `aclnnNonzeroV2` Related Error When Starting Online Inference
 
 - Key error message:
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
index 30ed00c205a9599bdfcd4a82019da7a82834c071..30a0bd0e77d50147d419710e5b93629e185f65ec 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/installation/installation.md
@@ -135,18 +135,6 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo
     bash install_depend_pkgs.sh
     ```
 
-    Compile and install vLLM-MindSpore Plugin:
-
-    ```bash
-    pip install .
-    ```
-
-    After executing the above commands, `mindformers` folder will be generated in the `vllm-mindspore/install_depend_pkgs` directory. Add this folder to the environment variables:
-
-    ```bash
-    export PYTHONPATH=$MF_PATH:$PYTHONPATH
-    ```
-
 - **vLLM-MindSpore Plugin Manual Installation**
 
     If users require custom modifications to dependent components such as vLLM, MindSpore, Golden Stick, or MSAdapter, they can prepare the modified installation packages locally and perform manual installation in a specific sequence. The installation sequence requirements are as follows:
@@ -169,11 +157,10 @@ vLLM-MindSpore Plugin can be installed in the following two ways. **vLLM-MindSpo
        pip install /path/to/mindspore-*.whl
        ```
 
-    4. Clone the MindSpore Transformers repository and add it to `PYTHONPATH`
+    4. Install MindSpore Transformers
 
        ```bash
-       git clone https://gitee.com/mindspore/mindformers.git
-       export PYTHONPATH=$MF_PATH:$PYTHONPATH
+       pip install /path/to/mindformers-*.whl
        ```
 
     5. Install Golden Stick
@@ -204,7 +191,6 @@ User can verify the installation with a simple offline inference test. First, us
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 About environment variables above, user can also refer to [environment variables section](../quick_start/quick_start.md#setting-environment-variables) for more details.
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
index a5e23d9cf2a34bcaef5fa566adde00dc22185da1..3d445a1c54e903c155874682635434263c3615b5 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/quick_start/quick_start.md
@@ -132,19 +132,11 @@ Before launching the model, user needs to set the following environment variable
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 Here is an explanation of these environment variables:
 
 - `VLLM_MS_MODEL_BACKEND`: The backend of the model to run. User could find supported models and backends for vLLM-MindSpore Plugin in the [Model Support List](../../user_guide/supported_models/models_list/models_list.md).
-- `MINDFORMERS_MODEL_CONFIG`: The model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml).
-
-Additionally, users need to ensure that MindSpore Transformers is installed. Users can introduce MindSpore Transformers through the following methods:
-
-```bash
-export PYTHONPATH=/path/to/mindformers:$PYTHONPATH
-```
 
 ### Offline Inference
 
@@ -193,10 +185,10 @@ vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API p
 Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command:
 
 ```bash
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
+vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct
 ```
 
-User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained:
+User can also pass the local model path to `vllm-mindspore serve` as model tag. If the service starts successfully, similar output will be obtained:
 
 ```text
 INFO:   Started server process [6363]
@@ -218,7 +210,7 @@ Use the following command to send a request, where `prompt` is the model input:
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}'
 ```
 
-User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
+User needs to ensure that the `"model"` field matches the model tag in the service startup, and the request can successfully match the model.
 
 If the request is processed successfully, the following inference result will be returned:
 
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index fc4154fb44932c5ea64fbe86d45fb7a50fc4811c..c3f7933c329c4ee167145b604f167039074e6f4c 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -115,12 +115,6 @@ Environment variable descriptions:
 - `ASCEND_RT_VISIBLE_DEVICES`: Configure the available device IDs for each node. Use the `npu-smi info` command to check.
 - `VLLM_MS_MODEL_BACKEND`: The backend of the model to run. Currently supported models and backends for vLLM-MindSpore Plugin can be found in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).
 
-Additionally, users need to ensure that MindSpore Transformers is installed. Users can introduce MindSpore Transformers through the following methods:
-
-```bash
-export PYTHONPATH=/path/to/mindformers:$PYTHONPATH
-```
-
 ### Starting Ray for Multi-Node Cluster Management
 
 On Ascend, the pyACL package must be installed to adapt Ray. Additionally, the CANN dependency versions on all nodes must be consistent.
@@ -212,7 +206,7 @@ vLLM-MindSpore Plugin can deploy online inference using the OpenAI API protocol.
 ```bash
 # Service launch parameter explanation
 vllm-mindspore serve
- --model=[Model Config/Weights Path]
+ [Model Tag: Config/Weights Path]
  --quantization [Source of weight quantification] # golden-stick/ascend are optional, respectively indicating that the quantified weights come from the golden-stick or modelslim quantification tools
  --trust-remote-code # Use locally downloaded model files
  --max-num-seqs [Maximum Batch Size]
@@ -227,10 +221,10 @@ Execution example:
 
 ```bash
 # Master node:
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 16 --distributed-executor-backend=ray
 ```
 
-In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can also set the local model path by `--model` argument.
+In tensor parallel scenarios, the `--tensor-parallel-size` parameter overrides the `model_parallel` configuration in the model YAML file. User can pass the local model path by `--model` argument.
 
 #### Sending Requests
 
@@ -279,9 +273,9 @@ Environment variable descriptions:
 `vllm-mindspore` can deploy online inference using the OpenAI API protocol. Below is the workflow for launching the service:
 
 ```bash
-# Parameter explanations for service launch
+# Parameter explanations for service launch  
 vllm-mindspore serve
- --model=[Model Config/Weights Path]
+ [Model Tag: Config/Weights Path]
  --quantization [Source of weight quantification] # golden-stick/ascend are optional, respectively indicating that the quantified weights come from the golden-stick or modelslim quantification tools
  --trust-remote-code # Use locally downloaded model files
  --max-num-seqs [Maximum Batch Size]
@@ -302,14 +296,14 @@ vllm-mindspore serve
 
 `data-parallel-size` and `tensor-parallel-size` specify the parallel policies for the attn and ffn-dense parts, and `expert_parallel` specifies the parallel policies for the routing experts in the MOE part. And it must satisfy that `data-parallel-size * tensor-parallel-size` is divisible by `expert_parallel`.
 
-User can also set the local model path by `--model` argument. The following is an execution example:
+User can also set the local model path as model tag. The following is an execution example:
 
 ```bash
 # Master node:
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}'
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}'
 
 # Worker node:
-vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}'
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless  --quantization ascend --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --additional-config '{"expert_parallel": 4}'
 ```
 
 #### Sending Requests
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 96cc8815155e55269060ae3d0f41945a063f8762..c66ea062c8940a4069ec550b761d11dda098ef76 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -128,13 +128,11 @@ For [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), the followi
 ```bash
 #set environment variables
 export VLLM_MS_MODEL_BACKEND=MindFormers # Use MindSpore TransFormers as the model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model YAML file.
 ```
 
 Here is an explanation of these environment variables:
 
 - `VLLM_MS_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).
-- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-32B, the YAML file is [predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml).
 
 Users can check memory usage with `npu-smi info` and set the NPU cards for inference using the following example (assuming cards 4,5,6,7 are used):
 
@@ -153,10 +151,10 @@ Use the model `Qwen/Qwen2.5-32B-Instruct` and start the vLLM service with the fo
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
-Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path by `--model` argument.
+Here, `TENSOR_PARALLEL_SIZE` specifies the number of NPU cards, and `MAX_MODEL_LEN` sets the maximum output token length. User can also set the local model path as model tag.
 
 If the service starts successfully, similar output will be obtained:
 
diff --git a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index 56aefd5c88c99dc4009e693871face9632ba01c0..906a501be56fab89ffaaa986855ff726c3270f03 100644
--- a/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_en/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -128,19 +128,16 @@ For [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), the following
 ```bash  
 #set environment variables  
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.  
 ```  
 
 Here is an explanation of these variables:  
 
 - `VLLM_MS_MODEL_BACKEND`: The model backend. Currently supported models and backends are listed in the [Model Support List](../../../user_guide/supported_models/models_list/models_list.md).
-- `MINDFORMERS_MODEL_CONFIG`: Model configuration file. User can find the corresponding YAML file in the [MindSpore Transformers repository](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5). For Qwen2.5-7B, the YAML file is [predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml).  
 
 User can check memory usage with `npu-smi info` and set the compute card for inference using:  
 
 ```bash  
-export NPU_VISIBLE_DEVICES=0  
-export ASCEND_RT_VISIBLE_DEVICES=$NPU_VISIBLE_DEVICES  
+export ASCEND_RT_VISIBLE_DEVICES=0  
 ```  
 
 ## Offline Inference
@@ -189,10 +186,10 @@ vLLM-MindSpore Plugin supports online inference deployment with the OpenAI API p
 Use the model `Qwen/Qwen2.5-7B-Instruct` and start the vLLM service with the following command:
 
 ```bash  
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"  
+vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct
 ```  
 
-User can also set the local model path by `--model` argument. If the service starts successfully, similar output will be obtained:
+User can also set the local model path as model tag. If the service starts successfully, similar output will be obtained:
 
 ```text  
 INFO:   Started server process [6363]  
@@ -214,7 +211,7 @@ Use the following command to send a request, where `prompt` is the model input:
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 15, "temperature": 0}'  
 ```  
 
-User needs to ensure that the `"model"` field matches the `--model` in the service startup, and the request can successfully match the model.
+User needs to ensure that the `"model"` field matches the model tag in the service startup, and the request can successfully match the model.
 
 If the request is processed successfully, the following inference result will be returned:
 
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
index b0ac4423ea87a405564af064e752c07f01b8d9d6..6b41bb4cf5b9859f6f0eafb3cb58cd5299f2a1e4 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/environment_variables/environment_variables.md
@@ -4,8 +4,7 @@
 
 | Environment Variable | Function | Type | Values | Description |
 |----------------------|----------|------|--------|-------------|
-| `VLLM_MS_MODEL_BACKEND` | Used to specify the model backend. If this variable is not set, the backend will be automatically selected in the priority order: MindFormers > Native > MindONE; if set, the specified backend will be used. | String | `MindFormers`: Model backend is MindSpore Transformers. `Native`: Model backend is Native. `MindONE`: Model backend is MindONE. | The native model backend currently supports the Qwen2.5, Qwen2.5VL, Qwen3 and Llama series; the MindSpore Transformers backend supports Qwen, DeepSeek and TeleChat models. When using MindSpore Transformers, set the environment variable: `export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`. |
-| `MINDFORMERS_MODEL_CONFIG` | Configuration file for MindSpore Transformers models. Required for Qwen2.5 series or DeepSeek series models. | String | Path to the model configuration file | **This environment variable will be removed in future versions.** Example: `export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`. |
+| `VLLM_MS_MODEL_BACKEND` | Used to specify the model backend. If this variable is not set, the backend will be automatically selected in the priority order: MindFormers > Native > MindONE; if set, the specified backend will be used. | String | `MindFormers`: Model backend is MindSpore Transformers. `Native`: Model backend is Native. `MindONE`: Model backend is MindONE. | The native model backend currently supports the Qwen2.5, Qwen2.5VL, Qwen3 and Llama series; the MindSpore Transformers backend supports Qwen, DeepSeek and TeleChat models. |
 | `GLOO_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using gloo. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. |
 | `TP_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using TP. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. |
 | `HCCL_SOCKET_IFNAME` | Specifies the network interface name for inter-machine communication using HCCL. | String | Interface name (e.g., `enp189s0f0`). | Used in multi-machine scenarios. The interface name can be found via `ifconfig` by matching the IP address. |
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
index b5fb97401a5008346e66cf72fb905aeb2bd4379a..a3de87028fb0331f790c4a1dad1d04fcf75674ed 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/benchmark/benchmark.md
@@ -10,13 +10,12 @@ For single-card inference, we take [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 then start the online inference with the following command:
 
 ```bash
-vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests
+vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-requests  
 ```
 
 For multi-card inference, we take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) as an example. You can prepare the environment by following the guide [Multi-Card Inference (Qwen2.5-32B)](../../../getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md#online-inference), then start the online inference with the following command:
@@ -24,7 +23,7 @@ For multi-card inference, we take [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
 If the service is successfully started, the following inference result will be returned:
@@ -103,7 +102,6 @@ For offline performance benchmark, take [Qwen2.5-7B](https://huggingface.co/Qwen
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 Clone the vLLM repository and import the vLLM-MindSpore plugin to reuse the benchmark tools:
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
index b3624353142ed96abf156a9891a275851bac7eec..5499334eae14af19bc649b9090ae63199e690dd1 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/profiling/profiling.md
@@ -17,7 +17,7 @@ After setting the variable, run the following command to launch the vLLM-MindSpo
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
 If the service starts successfully, you will see output similar to the following, indicating that the `start_profile` and `stop_profile` requests are being monitored:
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
index 22b668f57504d185ae21b8824968b7a68c4b1899..3e8c9afa944d1bc569ae8a70211cafd160d59595 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_features/quantization/quantization.md
@@ -28,11 +28,8 @@ Refer to the [Installation Guide](../../../getting_started/installation/installa
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
-For the yaml for the DeepSeek-R1 W8A8 quantization inference, user can use [predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml).
-
 Once ready, use the following Python code for offline inference:
 
 ```python
diff --git a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
index 02158ab389b166a4b298e38ab517e06382a4fb3d..5765955861b6970ee61a174b3ad290c875ceff22 100644
--- a/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
+++ b/docs/vllm_mindspore/docs/source_en/user_guide/supported_models/models_list/models_list.md
@@ -15,6 +15,5 @@
 | QwQ-32B | Testing | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)     |
 | Llama3.1 | Testing | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct), [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)  |
 | Llama3.2 | Testing | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)   |
-| DeepSeek-V2 | Testing | [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)     |
 
 Note: refer to [Environment Variable List](../../environment_variables/environment_variables.md), and set the model backend by environment variable `VLLM_MS_MODEL_BACKEND`.
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md b/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md
index dc1095c451fdf2f2daebabc278ae1934fa8ff546..35ab04969c653fb95d60dd14881c033a1de4ef3b 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/faqs/faqs.md
@@ -27,29 +27,6 @@
 
 ## 部署相关问题
 
-### 离线或在线推理时，报模型无法加载
-
-- 错误关键信息：
-
-   ```text
-   raise ValueError(f"{config.load_checkpoint} is not a valid path to load checkpoint ")
-   ```
-
-- 解决思路：
-  1. 检查模型路径是否存在且合法；
-  2. 若模型路径存在，且其中的模型文件为`safetensors`格式，则需要确认YAML文件中，是否已含有`load_ckpt_format: "safetensors"`字段；
-     1. 打印模型所使用的YAML文件路径：
-
-        ```bash
-        echo $MINDFORMERS_MODEL_CONFIG
-        ```
-
-     2. 查看该YAML文件，若不存在`load_ckpt_format`字段，则添加该字段：
-
-        ```text
-        load_ckpt_format: "safetensors"
-        ```
-
 ### 拉起在线推理时，报`aclnnNonzeroV2`相关错误
 
 - 错误关键信息：
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
index fad57bcffca35892649f5f574cd93b9cee1b53b1..425933888b3120eb81d77b66af30ce60c23bee8a 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/installation/installation.md
@@ -141,12 +141,6 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速
     pip install .
     ```
 
-    上述命令执行完毕之后，将在`vllm-mindspore/install_depend_pkgs`目录下生成`mindformers`文件夹，将其加入到环境变量中：
-
-    ```bash
-    export PYTHONPATH=$MF_PATH:$PYTHONPATH
-    ```
-
 - **vLLM-MindSpore插件手动安装**
 
     若用户对依赖的vLLM、MindSpore、Golden Stick、MSAdapter等组件有自定义修改的需求，可以在本地准备好修改后的安装包，按照特定的顺序进行手动安装。安装顺序要求如下：
@@ -169,11 +163,10 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速
         pip install /path/to/mindspore-*.whl
         ```
 
-    4. 引入MindSpore Transformers仓库，加入到`PYTHONPATH`中
+    4. 安装MindSpore Transformers
 
         ```bash
-        git clone https://gitee.com/mindspore/mindformers.git
-        export PYTHONPATH=$MF_PATH:$PYTHONPATH
+        pip install /path/to/mindformers-*.whl
         ```
 
     5. 安装Golden Stick
@@ -204,7 +197,6 @@ vLLM-MindSpore插件有以下两种安装方式。**vLLM-MindSpore插件快速
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 关于环境变量的具体含义，可参考[这里](../quick_start/quick_start.md#设置环境变量)。
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
index aaea0cbcf30805114896cbce394587004e886d41..502796a55d8cfdf5afe9cb07417b7773282f65e3 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/quick_start/quick_start.md
@@ -132,19 +132,11 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
-- `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的YAML文件。以Qwen2.5-7B为例，其YAML文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。
-
-另外，用户需要确保MindSpore Transformers已安装。用户可通过以下方式引入MindSpore Transformers：
-
-```bash
-export PYTHONPATH=/path/to/mindformers:$PYTHONPATH
-```
+- `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../user_guide/supported_models/models_list/models_list.md)中进行查询。
 
 ### 离线推理
 
@@ -193,10 +185,10 @@ vLLM-MindSpore插件可使用OpenAI的API协议，进行在线推理部署。以
 使用模型`Qwen/Qwen2.5-7B-Instruct`，执行如下命令启动vLLM服务：
 
 ```bash
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
+vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct
 ```
 
-用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功启动，则可以获得类似的执行结果：
+用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
@@ -218,7 +210,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-其中，用户需确认`"model"`字段与启动服务中的`--model`一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
+其中，用户需确认`"model"`字段与启动服务中的模型标签一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
index 0c5101341220475c1677b3446a0b67bb921aa97c..7010bbfb27c35e4541b4e597ba261db28725240f 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/deepseek_parallel/deepseek_r1_671b_w8a8_dp4_tp4_ep4.md
@@ -143,7 +143,6 @@ export HCCL_OP_EXPANSION_MODE=AIV
 export MS_ALLOC_CONF=enable_vmm:true
 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 export VLLM_MS_MODEL_BACKEND=MindFormers
-export PYTHONPATH=/path/to/mindformers:$PYTHONPATH
 export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
 export GLOO_SOCKET_IFNAME=enp189s0f0
 export HCCL_SOCKET_IFNAME=enp189s0f0
@@ -157,7 +156,6 @@ export TP_SOCKET_IFNAME=enp189s0f0
 - `MS_ALLOC_CONF`：设置内存策略。可参考[MindSpore官网文档](https://www.mindspore.cn/docs/zh-CN/master/api_python/env_var_list.html)。
 - `ASCEND_RT_VISIBLE_DEVICES`：配置每个节点可用的设备ID。用户可使用`npu-smi info`命令进行查询。
 - `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。
-- `PYTHONPATH`：将MindSpore Transformers路径，加入到`PYTHONPATH`。当`VLLM_MS_MODEL_BACKEND`设置为`MindFormers`需要配置。
 - `PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION`：当版本不兼容时使用。
 - `GLOO_SOCKET_IFNAME`：GLOO后端端口，用于多机之间使用gloo通信时的网口名称。可通过`ifconfig`查找IP对应网卡的网卡名。
 - `HCCL_SOCKET_IFNAME`：配置HCCL端口，用于多机之间使用HCCL通信时的网口名称。可通过`ifconfig`查找IP对应网卡的网卡名。
@@ -172,7 +170,7 @@ vLLM-MindSpore插件可使用OpenAI的API协议，部署在线推理。以下是
 ```bash
 # 启动配置参数说明
 vllm-mindspore serve
- --model=[模型Config/权重路径]
+ [模型标签：模型Config/权重路径]
  --trust-remote-code # 使用本地下载的model文件
  --max-num-seqs [最大Batch数]
  --max-model-len [模型上下文长度]
@@ -191,14 +189,14 @@ vllm-mindspore serve
  --addition-config # 并行功能与额外配置
 ```
 
-- 用户可以通过`--model`参数，指定模型保存的本地路径；
+- 用户可以通过指定模型保存的本地路径为模型标签；
 - 用户可以通过`--addition-config`参数，配置并行与其他功能。
 
 以下为Ray启动命令：
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray
 ```
 
 关于multiprocess启动命令，可以参考[multiprocess启动方式](../../../user_guide/supported_features/parallel/parallel.md#启动服务)。
@@ -211,7 +209,7 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am", "max_tokens": 120, "temperature": 0}'
 ```
 
-用户需确认`"model"`字段与启动服务中的`--model`一致，请求才能成功匹配到模型。
+用户需确认`"model"`字段与启动服务中的模型标签一致，请求才能成功匹配到模型。
 
 ## 附录
 
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
index 72d474c1c260794fe9a3f8a52172cbc5ea1981de..9a4f28ba706c0e9a2921e3a90679de7ae9a7cbd9 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_32b_multiNPU/qwen2.5_32b_multiNPU.md
@@ -128,13 +128,11 @@ git clone https://huggingface.co/Qwen/Qwen2.5-32B-Instruct
 ```bash
 #set environment variables
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
 - `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的YAML文件。以Qwen2.5-32B为例，其YAML文件为[predict_qwen2_5_32b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_32b_instruct.yaml)。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡。以下例子假设用户使用4、5、6、7卡进行推理：
 
@@ -153,12 +151,12 @@ vLLM-MindSpore插件可使用OpenAI的API协议，部署在线推理。以下以
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
 其中，`TENSOR_PARALLEL_SIZE`为用户指定的卡数，`MAX_MODEL_LEN`为模型最大输出token数。
 
-用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功启动，则可以获得类似的执行结果：
+用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
@@ -180,7 +178,7 @@ Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen/Qwen2.5-32B-Instruct", "prompt": "I am", "max_tokens": 20, "temperature": 0}'
 ```
 
-其中，用户需确认`"model"`字段与启动服务中的`--model`一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
+其中，用户需确认`"model"`字段与启动服务中的模型标签一致，请求才能成功匹配到模型。若请求处理成功，将获得以下推理结果：
 
 ```text
 {
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
index c629a9302617862e9e1d4c567228d00c642fdc1e..eb1d75640c69452dd49789be4e081d687b7df66f 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/getting_started/tutorials/qwen2.5_7b_singleNPU/qwen2.5_7b_singleNPU.md
@@ -128,19 +128,16 @@ git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
 ```bash
 #set environment variables
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore TransFormers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 以下是对上述环境变量的解释：
 
-- `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询；
-- `MINDFORMERS_MODEL_CONFIG`：模型配置文件。用户可以在[MindSpore Transformers工程](https://gitee.com/mindspore/mindformers/tree/master/research/qwen2_5)中，找到对应模型的YAML文件。以Qwen2.5-7B为例，其YAML文件为[predict_qwen2_5_7b_instruct.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/qwen2_5/predict_qwen2_5_7b_instruct.yaml)。
+- `VLLM_MS_MODEL_BACKEND`：所运行的模型后端。目前vLLM-MindSpore插件所支持的模型与模型后端，可在[模型支持列表](../../../user_guide/supported_models/models_list/models_list.md)中进行查询。
 
 用户可通过`npu-smi info`查看显存占用情况，并可以使用如下环境变量，设置用于推理的计算卡：
 
 ```bash
-export NPU_VISIBLE_DEVICES=0
-export ASCEND_RT_VISIBLE_DEVICES=$NPU_VISIBLE_DEVICES
+export ASCEND_RT_VISIBLE_DEVICES=0
 ```
 
 ## 离线推理
@@ -190,10 +187,10 @@ vLLM-MindSpore插件可使用OpenAI的API协议，部署在线推理。以下以
 使用如下命令启动vLLM服务：
 
 ```bash
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-7B-Instruct"
+vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct
 ```
 
-用户可以通过`--model`参数，指定模型保存的本地路径。若服务成功启动，则可以获得类似的执行结果：
+用户可以通过指定模型保存的本地路径作为模型标签。若服务成功启动，则可以获得类似的执行结果：
 
 ```text
 INFO:   Started server process [6363]
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
index a08a042409d505152e700f7f44e107bd54acdd75..b5d0a2b93418600f9d0519f42937b67df93fa4d8 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/environment_variables/environment_variables.md
@@ -4,8 +4,7 @@
 
 |   环境变量   |   功能   |   类型   |   取值   |   说明   |
 |   ------   |   -------  |   ------   |   ------   |   ------   |
-|   `VLLM_MS_MODEL_BACKEND`   |   用于指定模型后端。如果不配置变量，会按照 MindFormers > 原生模型 > MindONE 的优先级自动寻找支持的后端。配置之后则按指定后端执行。  |   String   | `MindFormers`: 模型后端为MindSpore Transformers。 `Native`: 模型后端为原生模型。 `MindONE`: 模型后端为MindONE  |   原生模型后端当前支持Qwen2.5、Qwen2.5VL、Qwen3、Llama系列；MindSpore Transformers模型后端支持Qwen系列、DeepSeek、TeleChat系列模型，使用时需配置环境变量：`export PYTHONPATH=/path/to/mindformers/:$PYTHONPATH`。   |
-|   `MINDFORMERS_MODEL_CONFIG`   |   MindSpore Transformers模型的配置文件。使用Qwen2.5系列、DeepSeek系列模型时，需要配置文件路径。   |   String   |   模型配置文件路径。   |   **该环境变量在后续版本会被移除。** 样例：`export MINDFORMERS_MODEL_CONFIG=/path/to/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml`。   |
+|   `VLLM_MS_MODEL_BACKEND`   |   用于指定模型后端。如果不配置变量，会按照 MindFormers > 原生模型 > MindONE 的优先级自动寻找支持的后端。配置之后则按指定后端执行。  |   String   | `MindFormers`: 模型后端为MindSpore Transformers。 `Native`: 模型后端为原生模型。 `MindONE`: 模型后端为MindONE  |   原生模型后端当前支持Qwen2.5、Qwen2.5VL、Qwen3、Llama系列；MindSpore Transformers模型后端支持Qwen系列、DeepSeek、TeleChat系列模型。   |
 |   `GLOO_SOCKET_IFNAME`   |   用于多机之间使用gloo通信时的网口名称。   |   String   |  网口名称，例如enp189s0f0。    |   多机场景使用，可通过`ifconfig`查找IP对应网卡的网卡名。   |
 |   `TP_SOCKET_IFNAME`   |   用于多机之间使用TP通信时的网口名称。   |   String   | 网口名称，例如enp189s0f0。      |   多机场景使用，可通过`ifconfig`查找IP对应网卡的网卡名。   |
 | `HCCL_SOCKET_IFNAME` | 用于多机之间使用HCCL通信时的网口名称。 | String | 网口名称，例如enp189s0f0。  | 多机场景使用，可通过`ifconfig`查找IP对应网卡的网卡名。 |
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
index ac99837d47c5ddd5673d9636144d387309b65403..00de8a17ec5261a1f451d0d83306bb0c2f4f62d5 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/benchmark/benchmark.md
@@ -10,7 +10,6 @@ vLLM-MindSpore插件的性能测试能力，继承自vLLM所提供的性能测
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 使用以下命令启动在线推理：
@@ -24,7 +23,7 @@ vllm-mindspore serve Qwen/Qwen2.5-7B-Instruct --device auto --disable-log-reques
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
 当返回以下日志时，则服务已成功拉起：
@@ -103,7 +102,6 @@ P99 ITL (ms):                            ....
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
 并拉取vLLM代码仓库，导入vLLM-MindSpore插件，复用其中的benchmark功能：
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md
index 12dee2759bd7e3b6434cd5687d2ca1cb55107fdd..794304e170897e75ef97ca169ad110e3f1298de9 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/parallel/parallel.md
@@ -21,7 +21,7 @@ vLLM-MindSpore插件支持张量并行（TP）、数据并行（DP）、专家
 ```bash
 TENSOR_PARALLEL_SIZE=4       # TP 并行数
 
-vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
+vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
 ```
 
 ### 多机示例
@@ -34,7 +34,7 @@ vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor
 # 主节点：
 
 TENSOR_PARALLEL_SIZE=4       # TP 并行数
-vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
+vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
 ```
 
 ## 数据并行
@@ -65,7 +65,7 @@ Ray在多机场景中可以简化启动，是推荐的启动方式，请参考[R
 DATA_PARALLEL_SIZE=4       # DP 并行数
 DATA_PARALLEL_SIZE_LOCAL=2 # 当前服务节点中的DP数，所有节点求和等于`--data-parallel-size`
 
-vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} --data-parallel-backend=ray
+vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL} --data-parallel-backend=ray
 ```
 
 #### multiprocess启动
@@ -74,10 +74,10 @@ vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-p
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
+vllm-mindspore serve /path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
 
 # 从节点：
-vllm-mindspore serve --headless --model=/path/to/Qwen2.5/model --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
+vllm-mindspore serve /path/to/Qwen2.5/model --headless --trust-remote-code --data-parallel-size ${DATA_PARALLEL_SIZE} --data-parallel-size-local ${DATA_PARALLEL_SIZE_LOCAL}
 ```
 
 ## 专家并行
@@ -107,7 +107,7 @@ vllm-mindspore serve --headless --model=/path/to/Qwen2.5/model --trust-remote-co
 以下命令为单机八卡，启动Qwen-3 MOE的专家并行示例：
 
 ```bash
-vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8}
+vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8}
 ```
 
 #### 多机示例
@@ -115,7 +115,7 @@ vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-exp
 多机专家并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。以下命令为双机四卡，Ray启动Qwen-3 MOE的专家并行示例：
 
 ```bash
-vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray
+vllm-mindspore serve /path/to/Qwen3-MOE --trust-remote-code --enable-expert-parallel --addition-config '{"expert_parallel": 8} --data-parallel-backend=ray
 ```
 
 ## 混合并行
@@ -129,7 +129,7 @@ vllm-mindspore serve --model=/path/to/Qwen3-MOE --trust-remote-code --enable-exp
 可根据上述介绍，分别将三种并行策略的配置叠加，在启动命令`vllm-mindspore serve`中使能。多机混合并行依赖Ray进行启动。请参考[Ray多节点集群管理](#ray多节点集群管理)进行Ray环境配置。其叠加后混合并行的Ray启动命令如下：
 
 ```bash
-vllm-mindspore serve --model="/path/to/DeepSeek-R1" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray
+vllm-mindspore serve /path/to/DeepSeek-R1 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"expert_parallel": 4}' --data-parallel-backend=ray
 ```
 
 ## 附录
@@ -264,7 +264,7 @@ vLLM-MindSpore插件可使用OpenAI的API协议部署在线推理。以下是在
 ```bash
 # 启动配置参数说明
 vllm-mindspore serve
- --model=[模型Config/权重路径]
+ [模型标签：模型Config/权重路径]
  --trust-remote-code # 使用本地下载的model文件
  --max-num-seqs [最大Batch数]
  --max-model-len [模型上下文长度]
@@ -283,7 +283,7 @@ vllm-mindspore serve
  --addition-config # 并行功能与额外配置
 ```
 
-- 用户可以通过`--model`参数，指定模型保存的本地路径。
+- 用户可以通过指定模型保存的本地路径为模型标签；
 
 - 用户可以通过`--addition-config`参数，配置并行与其他功能。其中并行可进行如下配置，对应的是DP4-EP4-TP4场景：
 
@@ -297,17 +297,17 @@ vllm-mindspore serve
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}'
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 0 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}'
 
 # 从节点：
-vllm-mindspore serve --headless --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}'
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --headless --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --data-parallel-start-rank 2 --data-parallel-address 192.10.10.10 --data-parallel-rpc-port 12370 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}'
 ```
 
 **Ray启动方式**
 
 ```bash
 # 主节点：
-vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray
+vllm-mindspore serve MindSpore-Lab/DeepSeek-R1-0528-A8W8 --trust-remote-code --max-num-seqs=256 --max-model-len=32768 --max-num-batched-tokens=4096 --block-size=128 --gpu-memory-utilization=0.9 --tensor-parallel-size 4 --data-parallel-size 4 --data-parallel-size-local 2 --enable-expert-parallel --addition-config '{"data_parallel": 4, "model_parallel": 4, "expert_parallel": 4}' --data-parallel-backend=ray
 ```
 
 #### 发送请求
@@ -318,4 +318,4 @@ vllm-mindspore serve --model="MindSpore-Lab/DeepSeek-R1-0528-A8W8" --trust-remot
 curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "MindSpore-Lab/DeepSeek-R1-0528-A8W8", "prompt": "I am, "max_tokens": 120, "temperature": 0}'
 ```
 
-用户需确认`"model"`字段与启动服务中的`--model`一致，请求才能成功匹配到模型。
+用户需确认`"model"`字段与启动服务中的模型标签一致，请求才能成功匹配到模型。
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
index 06421892a7a467ce5c43467dcb52279b31db1539..99b4e673fa80292b6d1fba9b4e09611470b26771 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/profiling/profiling.md
@@ -17,7 +17,7 @@ export VLLM_TORCH_PROFILER_DIR=/path/to/save/vllm_profile
 ```bash
 export TENSOR_PARALLEL_SIZE=4
 export MAX_MODEL_LEN=1024
-python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model "Qwen/Qwen2.5-32B-Instruct" --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
+vllm-mindspore serve Qwen/Qwen2.5-32B-Instruct --trust_remote_code --tensor-parallel-size $TENSOR_PARALLEL_SIZE --max-model-len $MAX_MODEL_LEN
 ```
 
 若服务成功拉起，则可以获得类似的执行结果，且可以从中看到监听`start_profile`和`stop_profile`请求：
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
index 4398b3e67b9a1105bde200e9ea7684baa32924d1..42428355a13d46c3df016ea73dfab646011df456 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_features/quantization/quantization.md
@@ -28,11 +28,8 @@
 
 ```bash
 export VLLM_MS_MODEL_BACKEND=MindFormers # use MindSpore Transformers as model backend.
-export MINDFORMERS_MODEL_CONFIG=$YAML_PATH # Set the corresponding MindSpore Transformers model's YAML file.
 ```
 
-关于DeepSeek-R1 W8A8量化推理的YAML文件，可以使用[predict_deepseek_r1_671b_w8a8.yaml](https://gitee.com/mindspore/mindformers/blob/master/research/deepseek3/deepseek_r1_671b/predict_deepseek_r1_671b_w8a8.yaml)。
-
 环境准备完成后，用户可以使用如下Python代码，进行离线推理服务：
 
 ```python
diff --git a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
index fd85db6ea63dbe2bfcf0ba3bf09601a8c0d67ac1..4a2281865b4eff200a33884389bb3b27a319aa93 100644
--- a/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
+++ b/docs/vllm_mindspore/docs/source_zh_cn/user_guide/supported_models/models_list/models_list.md
@@ -15,6 +15,5 @@
 | QwQ-32B | 测试中 | [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)     |
 | Llama3.1 | 测试中 | [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)、[Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)、[Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)  |
 | Llama3.2 | 测试中 | [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)、[Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)   |
-| DeepSeek-V2 | 测试中 | [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)     |
 
 注：用户可参考[环境变量章节](../../environment_variables/environment_variables.md)，通过环境变量`VLLM_MS_MODEL_BACKEND`，指定模型后端。