diff --git a/dllm-feature-introduce.md b/dllm-feature-introduce.md new file mode 100644 index 0000000000000000000000000000000000000000..ee70a1be9a93f82331e2f6d48dcf390ecc9de85f --- /dev/null +++ b/dllm-feature-introduce.md @@ -0,0 +1,92 @@ +# dllm + +stand for "distributed llm", aims at providing better tools for distributed vllm serving framework. + +## Build guide + +> **TL;DR** +> +> ``` +> yum install python3-pip gcc g++ cmake spdlog-devel -y +> pip install --upgrade pip +> pip install --upgrade wheel setuptools ninja pybind11 chariot-ds +> +> python3 setup.py bdist_wheel +> ``` + +### Build requires + +**build tools** + +* `gcc/g++/make/cmake`: can be installed by `yum install gcc g++ cmake -y` +* `ninja`: can be installed by `pip install ninja` +* `python/pip`: can be installed by `yum install python3-pip; pip install --upgrade pip;` +* `wheel/setuptools`: can be installed by `pip install --upgrade pip wheel setuptools` + +> NOTE: upgrade setuptools is necessary in most of OS + +**dependencies** + +* `spdlog`: can be installed by `yum install spdlog-devel -y` +* `pybind11`: can be installed by `pip install pybind11` +* `chariot-ds`: can be installed by `pip install chariot-ds` +* `ascend cann`: access https://www.hiascend.com/software/cann for installation + +### Build command + +```bash +bash build.sh +# or python3 setup.py bdist_wheel +``` + +## Install guide + +```bash +pip install dist/dllm-*.whl +``` + +## Use guide + +### deploy dependencies + +> NOTE: After deploy chariot-ds, you need to set the envrionment `DS_WORKER_ADDR="{IP}:{PORT}"` on each node before start ray. + +1. chariot-ds: follow https://pypi.org/project/chariot-ds/ +2. Ray: follow https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html#on-prem + +### deploy dllm + +use vllm-mindspore as an example, when use, + +* 1 Prefill instance, with parallel config: [TP: 4, DP: 4, EP: 16] +* 1 Decode instance, with parallel config: [TP: 4, DP: 4, EP: 16] + +the command should be like: + +```bash +dllm deploy \ + --prefill-instances-num=1 \ + --decode-instances-num=1 \ + -ptp=4 -dtp=4 -pdp=4 -ddp=4 -pep=16 -dep=16 \ + --prefill-startup-params="vllm-mindspore serve --model=/workspace/models/qwen2.5_7B --trust_remote_code --max-num-seqs=256 --max_model_len=1024 --max-num-batched-tokens=1024 --block-size=128 --gpu-memory-utilization=0.93" \ + --decode-startup-params="vllm-mindspore serve --model=/workspace/models/qwen2.5_7B --trust_remote_code --max-num-seqs=256 --max_model_len=1024 --max-num-batched-tokens=1024 --block-size=128 --gpu-memory-utilization=0.93" +``` + +After deploy success, can access the localhost:8000 as a general openai api endpoint (which is fully compatible) + +```bash +curl -X POST "http://127.0.0.1:8000/v1/completions" -H "Content-Type: application/json" -H "Authorization: Bearer YOUR_API_KEY" -d '{ + "model": "/workspace/models/qwen2.5_7B", + "prompt": "Alice is ", + "max_tokens": 50, + "temperature": 0 +}' +``` + +### enable kv cache protect + +To prevent private data leakage, dllm support kv cache protect by encrypt kv cache data when transmitting between prefill and decode instance n PD disaggregated deployment + +Kv cache data is encrypt by sec-mask in parallel with inference to enhance encryption performance + +To enable kv cache protect, you need to set environment **before start Ray**: `ENABLE_KVC_PROTECT=True`