# 1Cat-vLLM **Repository Path**: zhutao1221/1Cat-vLLM ## Basic Information - **Project Name**: 1Cat-vLLM - **Description**: No description available - **Primary Language**: Python - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-04-30 - **Last Updated**: 2026-05-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 1Cat-vLLM 1.0.0 > 一猫之下始终相信,V100 不该在今天的大模型浪潮里被轻易宣判“过时”。 > `1Cat-vLLM 1.0.0` 不是一次简单的适配更新,而是一次面向 > **SM70 / Tesla V100** 的系统性工程重构。我们围绕 AWQ、注意力后端、 > 长上下文稳定性、运行时默认值和部署路径做了成体系的打磨,极大提升了 > V100 的模型使用上限,让更多原本“难以跑起来、难以跑稳定、难以跑得快” > 的现代模型场景,真正变得可用、好用、能持续部署。 > > 在我们聚焦和验证过的 V100 场景里,这个版本不仅显著抬升了上下文能力与 > 部署稳定性,也带来了业界领先的推理速度表现。对还在使用 V100 的个人开发者、 > 工作室和团队来说,这意味着老卡依然有很强的生命力,依然值得被继续挖掘。 > 我们真心希望 V100 开源社区越来越好,也希望把一猫之下自己的工程经验、 > 优化成果和热情,实实在在地贡献给社区。感谢每一位关注、使用、反馈和支持 > 一猫之下的朋友。你们的支持,是我们继续把这件事做深、做久、做好的动力。 `1Cat-vLLM 1.0.0` is the recommended public release of the **Tesla V100 / SM70** vLLM fork for **AWQ 4-bit inference on Volta GPUs,and FlashAttn-2!!**. Upstream vLLM AWQ kernels normally require **SM75+** in the default path. This branch integrates **lmdeploy TurboMind SM70 WMMA kernels**, **FLASH_ATTN_V100**, and a set of SM70-specific runtime fixes so that V100 can serve modern AWQ models, especially **Qwen3.5 / Qwen3.6 dense and MoE models**. Compared with the earlier `0.0.x` line, `1.0.0` focuses on the new V100 attention backend, Qwen3.5/Qwen3.6 model coverage, FP8 KV cache support, MTP serving, output-quality stability fixes, and a cleaner public wheel installation path. The validated default path now centers on the prebuilt `v1.0.0` wheels and `FLASH_ATTN_V100` instead of source builds or the older Triton attention fallback. ## Recommended model providers - `tclf90/Qwen3.6-27B-AWQ` - `tclf90/Qwen3.6-35B-A3B-AWQ` - `tclf90/Qwen3.5-122B-A10B-AWQ` for larger 4-GPU setups The launch commands below use short model names such as `Qwen3.5-27B-AWQ` and `Qwen3.6-35B-A3B-AWQ`. This assumes one of the following is true: - you have local model directories with exactly these names - you replace `--model` with your real local path - you replace `--model` with the full Hugging Face repo id ## What this branch adds - AWQ 4-bit support for **SM70 / Tesla V100** - Dense and MoE AWQ execution paths on V100 - Reuse of SM70 AWQ kernels for selected compressed-tensors MoE paths - `FLASH_ATTN_V100` decode and prefill backend for Volta GPUs - Qwen3.5 / Qwen3.6 model and config support, including MoE and MTP paths - SM70-specific MLA/GDN runtime fixes - Compatibility with `torch.compile` and CUDA graphs - OpenAI-compatible API serving through standard vLLM entrypoints ## What is new in 1.0.0 - A release step forward over `0.0.3` for **V100-flash-attention**, Qwen3.5/Qwen3.6 coverage, public packaging, and output-quality stability - A **two-wheel** installation path for `Python 3.12 + CUDA 12.8` (`flash_attn_v100` plus `vllm`) - FP8 KV cache support for the V100 FA path, with `fp8_e5m2` documented as the current experimental V100 option - MTP speculative decoding support for Qwen3.6-class models - Tool-calling and OpenAI-compatible API fixes for Cherry Studio, OpenClaw, and similar OpenAI API clients - DFlash is included as an experimental path for continued validation - Public runtime defaults now center on: - `--attention-backend FLASH_ATTN_V100` - `--max-model-len 262144` - explicit low-concurrency serving limits such as `--max-num-seqs` and `--max-num-batched-tokens` - V100 `32 GB` reference configs for 4-card systems: - `Qwen3.5-27B-AWQ` - `Qwen3.6-35B-A3B-AWQ` - `Qwen3.5-122B-A10B-AWQ` - Long-prompt chunk budget for `FLASH_ATTN_V100` on 32 GB V100 defaults to `max_num_batched_tokens=16384` - Direct paged prefill remains experimental and is not the public default ## Reference hardware platforms `1.0.0` is validated primarily on 4-card V100 systems. The recommended public commands below assume **4 x V100 32 GB** and text-generation workloads. | Public reference host | Notes | | --- | --- | | 4 x Tesla PG503 / V100 32 GB | Recommended target for Qwen3.5/Qwen3.6 AWQ serving | - `Qwen3.5-27B-AWQ`: supported on TP1/TP2/TP4, with TP4 recommended for this README - `Qwen3.6-35B-A3B-AWQ`: TP4 recommended for the public command - `Qwen3.5-122B-A10B-AWQ`: TP4 only in the public command ## Benchmarks / Effort figures The following local `1.0.0` regression charts were generated on a 4-card V100 32 GB system. First-request warmup is not included as steady-state throughput. ### Local test charts | `Qwen3.5-27B-AWQ` | `Qwen3.6-35B-A3B-AWQ` | `Qwen3.5-122B-A10B-AWQ` | | --- | --- | --- | | [![Qwen3.5-27B-AWQ](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.5-27B-AWQ,20260501_001139.png)](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.5-27B-AWQ,20260501_001139.png) | [![Qwen3.6-35B-A3B-AWQ](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.6-35B-A3B-AWQ,20260501_002046.png)](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.6-35B-A3B-AWQ,20260501_002046.png) | [![Qwen3.5-122B-A10B-AWQ](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.5-122B-A10B-AWQ,20260501_004543.png)](docs/test-table/Tesla_PG503-32G_x4,1Cat-vLLM-0.0.3,Qwen3.5-122B-A10B-AWQ,20260501_004543.png) | - first-request warmup on V100 is slow and is not representative - long-context throughput depends strongly on `TP`, `max_num_seqs`, and the attention backend - the public runtime defaults in this README prioritize stable serving over peak single-case benchmark numbers ### Reproducible 27B decode baseline The `1.0.0` 27B speed baseline is measured as **incremental decode TPS**: ```text incremental_decode_tps = (decode64_output_tokens - decode1_output_tokens) / (decode64_median_latency - decode1_median_latency) ``` This removes prefill/TTFT from the measurement. It is stricter than API streaming throughput and should not be compared directly with browser-side OpenAI streaming numbers. Reference result on `4 x Tesla PG503 / V100 32 GB`: | Model | Backend | TP | Custom all-reduce | Short-context incremental decode | 8K-context incremental decode | | --- | --- | ---: | --- | ---: | ---: | | `Qwen3.5-27B-AWQ` | `FLASH_ATTN_V100` | 4 | enabled | `86.31 tok/s` | `79.04 tok/s` | Strict reproduction command. This speed-only harness intentionally keeps `max_model_len=12288` to match the historical model-side regression test. The public serving commands below default to 256K context with `max_model_len=262144`. ```bash export ONECAT_VLLM_REPO=/path/to/1Cat-vLLM/vllm cd /tmp CUDA_VISIBLE_DEVICES=0,1,2,3 \ HF_HUB_OFFLINE=1 \ TRANSFORMERS_OFFLINE=1 \ python "$ONECAT_VLLM_REPO/tools/vllm_v100_backend_regression.py" \ --child \ --backend FLASH_ATTN_V100 \ --model /path/to/Qwen3.5-27B-AWQ \ --dtype float16 \ --kv-cache-dtype auto \ --max-model-len 12288 \ --max-num-seqs 8 \ --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.88 \ --tensor-parallel-size 4 \ --prompt-style qwen35-chat \ --disable-thinking \ --disable-mm \ --quality-max-tokens 1 \ --long-prompt-tokens 8202 \ --speed-warmup 3 \ --speed-iters 5 \ --skip-quality \ --child-output /tmp/qwen35_27b_fa2_baseline.json ``` Expected key latencies: - `batch1_prefill512_decode1`: about `0.179 s` - `batch1_prefill512_decode64`: about `0.909 s` - incremental decode: about `86 tok/s` Do **not** add `--disable-custom-all-reduce` for the 27B TP4 baseline. On the same hardware this drops short-context incremental decode from about `86.31 tok/s` to about `75.91 tok/s`. ## 微信交流群 **群聊:** 1Cat-vLLM 开源交流群3 请使用微信扫描下方二维码加入群组: ![1Cat-vLLM 微信交流群二维码](docs/assets/wechat-group-qr.png) > 提示:微信群二维码通常 7 天内有效。若扫描失败或提示过期,请重新打开本页查看最新图片,或关注仓库更新。 ## Validated stack The commands in this README were validated on the following setup: - OS: `Ubuntu 24.04.4 LTS` - Python: `3.12.13` - CUDA toolkit: `12.8` - PyTorch: `2.9.1+cu128` - Triton: `3.5.1` - Driver: `570.211.01` - GPU: `4 x Tesla V100 32 GB` public reference profile The public launch commands below are written for 4-card V100 32 GB systems. ## Runtime notes you should read first - The **first real request is not representative** of steady-state speed. On V100, the first request may spend **1 to 3 minutes** compiling kernels, building graphs, and warming up execution paths. - The public commands in this README are text-generation profiles. Vision or multimodal workloads should be tuned separately. - For Qwen3.5/Qwen3.6 text-only serving on V100 32 GB, the recommended public commands explicitly set only the serving choices that change behavior: - `--attention-backend FLASH_ATTN_V100` - `--max-model-len 262144` - `--max-num-seqs` and `--max-num-batched-tokens` - `--enable-prefix-caching` for the MTP + prefix-cache profile - `--gpu-memory-utilization` is an upper bound for the model executor. By default, 1Cat-vLLM trims the final KV cache allocation to about `1.05 * max_model_len * max_num_seqs`, so single-request 256K serving does not preallocate KV capacity for many extra full-length requests. Set `--kv-cache-auto-trim-ratio 0` to keep upstream vLLM's "use all requested memory for KV cache" behavior, or use `--kv-cache-memory-bytes` for an exact per-GPU KV cache size. - `VLLM_SM70_ENABLE_DENSE_F16_FASTPATH=1` is experimental. Keep it disabled for the public 35B/122B MoE commands. - Direct paged prefill can be forced with `VLLM_FLASH_V100_ENABLE_PAGED_PREFILL=1`, but it is not the quality-safe default. ## Quick start ### 1. Install CUDA 12.8 Use the official NVIDIA repository on Ubuntu 24.04: ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install -y cuda-toolkit-12-8 ``` If the machine also has CUDA 13.x installed, force build-time and runtime CUDA to 12.8: ```bash export CUDA_HOME=/usr/local/cuda-12.8 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-} hash -r nvcc -V ``` ### 2. Create the conda environment ```bash source /path/to/miniconda3/etc/profile.d/conda.sh conda create -y -n 1Cat-vLLM-1.0.0 python=3.12 conda activate 1Cat-vLLM-1.0.0 python -m pip install --upgrade pip setuptools wheel ``` ### 3. Recommended install path: prebuilt wheel Use the release wheel if you only want to run the project. This is the recommended installation path. Source builds are for kernel development and are not recommended for normal deployment. The wheel install pulls the matching `torch==2.9.1+cu128` runtime from the PyTorch CUDA 12.8 index. `--no-cache-dir` is recommended because the CUDA runtime wheels are large. Install from a local wheel file: ```bash python -m pip install --prefer-binary --no-cache-dir \ --extra-index-url https://download.pytorch.org/whl/cu128 \ ./dist-cu128-sm70-1.0.0/flash_attn_v100-*.whl \ ./dist-cu128-sm70-1.0.0/vllm-*.whl ``` Or install from a GitHub release asset: ```bash python -m pip install --prefer-binary --no-cache-dir \ --extra-index-url https://download.pytorch.org/whl/cu128 \ "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/flash_attn_v100-1.0.0-cp312-cp312-linux_x86_64.whl" \ "https://github.com/1CatAI/1Cat-vLLM/releases/download/v1.0.0/vllm-1.0.0-cp312-cp312-linux_x86_64.whl" ``` Notes: - This is the **recommended first installation path** for public users. - `flash_attn_v100` is a separate wheel and should be installed together with the vLLM wheel. - Runtime installation from the wheels does not require the `lmdeploy` source tree. - Use `Python 3.12` and `CUDA 12.8`. - If your shell has a broken local proxy configured, unset it before installing: `env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u ALL_PROXY -u all_proxy ...`. - After installing from wheels, run `python -m vllm...` from a directory outside this source checkout, such as `cd ~` or `cd /tmp`. Running inside the cloned repository makes Python import the local `vllm/` source tree, which does not contain the wheel-installed CUDA extension files such as `vllm/_C.abi3.so`. ### 4. Verify the environment ```bash python - <<'PY' import torch, triton, vllm, sys import flash_attn_v100_cuda, paged_kv_utils print("python", sys.version.split()[0]) print("torch", torch.__version__) print("torch_cuda", torch.version.cuda) print("triton", triton.__version__) print("vllm", vllm.__version__) print("flash_attn_v100", "ok") PY ``` ## Docker deployment Docker deployment follows the same wheel-first approach. This release does not include a dedicated `1.0.0` wheel-runtime Dockerfile yet, so use the conda wheel path above for final local validation. ### 1. Build the recommended SM70 runtime image ```bash # No dedicated 1.0.0 wheel-runtime Dockerfile is included in this tree yet. # Use the conda wheel install path above, or adapt docker/Dockerfile for source build. ``` The first Docker build will download several gigabytes of PyTorch and CUDA runtime layers. The build context for this repository is already trimmed, but the Docker image store still lives under the host Docker root directory unless you have moved it yourself. This Dockerfile intentionally uses `python:3.12-slim-trixie`. The current SM70 wheel needs `glibc >= 2.38`, and the runtime image also keeps `gcc/g++` installed because Triton compiles a small helper module on first startup. This image is pinned to: - `Python 3.12` - `Debian trixie / glibc 2.41` - `torch 2.9.1` - `torchvision 0.24.1` - `torchaudio 2.9.1` - `gcc/g++` for Triton first-run compilation - the current `v1.0.0` release wheel The runtime entrypoint should include these public defaults: - `FLASH_ATTN_V100` as the V100 attention backend - `--max-model-len 262144` - explicit `max_num_seqs` and `max_num_batched_tokens` limits for the target model If you want runtime caches to stay on a large disk, add these options to the `docker run` commands below: - `-v /path/to/1t-cache/hf:/cache/hf -e HF_HOME=/cache/hf` - `-v /path/to/1t-cache/triton:/cache/triton -e TRITON_CACHE_DIR=/cache/triton` - `-v /path/to/1t-cache/torchinductor:/cache/torchinductor -e TORCHINDUCTOR_CACHE_DIR=/cache/torchinductor` - `-v /path/to/1t-cache/tmp:/cache/tmp -e TMPDIR=/cache/tmp` Final Docker validation data will be added after the wheel-runtime image is rebuilt for `1.0.0`. ### 2. Run on four `32 GB` V100 with `Qwen3.5-27B-AWQ` ```bash docker run --rm \ --gpus '"device=0,1,2,3"' \ --ipc=host \ -p 8000:8000 \ -v /path/to/models:/models:ro \ -e VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100 \ -e VLLM_MODEL=/models/Qwen3.5-27B-AWQ \ -e VLLM_SERVED_MODEL_NAME=Qwen3.5-27B-AWQ \ -e VLLM_TENSOR_PARALLEL_SIZE=4 \ -e VLLM_GPU_MEMORY_UTILIZATION=0.88 \ -e VLLM_MAX_MODEL_LEN=262144 \ -e VLLM_MAX_NUM_SEQS=1 \ -e VLLM_MAX_NUM_BATCHED_TOKENS=16384 \ 1cat-vllm-sm70:1.0.0 ``` ### 3. Run on four `32 GB` V100 with `Qwen3.6-35B-A3B-AWQ` ```bash docker run --rm \ --gpus '"device=0,1,2,3"' \ --ipc=host \ -p 8000:8000 \ -v /path/to/models:/models:ro \ -e VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100 \ -e VLLM_MODEL=/models/Qwen3.6-35B-A3B-AWQ \ -e VLLM_SERVED_MODEL_NAME=Qwen3.6-35B-A3B-AWQ \ -e VLLM_TENSOR_PARALLEL_SIZE=4 \ -e VLLM_GPU_MEMORY_UTILIZATION=0.88 \ -e VLLM_MAX_MODEL_LEN=262144 \ -e VLLM_MAX_NUM_SEQS=1 \ -e VLLM_MAX_NUM_BATCHED_TOKENS=8192 \ 1cat-vllm-sm70:1.0.0 ``` ### 4. Run on four `32 GB` V100 with `Qwen3.5-122B-A10B-AWQ` ```bash docker run --rm \ --gpus '"device=0,1,2,3"' \ --ipc=host \ -p 8000:8000 \ -v /path/to/models:/models:ro \ -e VLLM_ATTENTION_BACKEND=FLASH_ATTN_V100 \ -e VLLM_MODEL=/models/Qwen3.5-122B-A10B-AWQ \ -e VLLM_SERVED_MODEL_NAME=Qwen3.5-122B-A10B-AWQ \ -e VLLM_TENSOR_PARALLEL_SIZE=4 \ -e VLLM_GPU_MEMORY_UTILIZATION=0.88 \ -e VLLM_MAX_MODEL_LEN=262144 \ -e VLLM_MAX_NUM_SEQS=1 \ -e VLLM_MAX_NUM_BATCHED_TOKENS=8096 \ 1cat-vllm-sm70:1.0.0 ``` ### 5. Quick API check ```bash curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen3.5-27B-AWQ", "messages": [{"role": "user", "content": "只回答最终结果:2+2等于几?"}], "temperature": 0, "max_completion_tokens": 16, "chat_template_kwargs": {"enable_thinking": false} }' ``` ### 6. Container source build Container source build is still available through the upstream-style multi-stage [`docker/Dockerfile`](docker/Dockerfile), but it is not the recommended first path for public users. For this fork, the recommended public Docker path is still the released wheel image above. ## Source build Source build is still supported, but it is **not recommended** for public runtime deployment. Install the release wheels first unless you are changing CUDA/C++/Triton code. Only use it if: - you want to modify CUDA or Triton code - you want to rebuild your own wheel - you are doing development on this fork ### 1. Bundled `lmdeploy` source dependency This repository already includes the validated `lmdeploy` source tree needed for the SM70 AWQ build path. ```bash cd /path/to/vllm test -d lmdeploy ``` ### 2. Install build dependencies ```bash cd /path/to/vllm source /path/to/miniconda3/etc/profile.d/conda.sh conda activate 1Cat-vLLM-1.0.0 python -m pip install -r requirements/build.txt python -m pip install -r requirements/cuda.txt python -m pip install -r requirements/common.txt python -m pip install cmake build ``` ### 3. Build from source The current validated `1.0.0` source build uses `CUDA 12.8`, `SM70`, and `MAX_JOBS=12`. ```bash cd /path/to/vllm source /path/to/miniconda3/etc/profile.d/conda.sh conda activate 1Cat-vLLM-1.0.0 export CUDA_HOME=/usr/local/cuda-12.8 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-} export TORCH_CUDA_ARCH_LIST="7.0" export MAX_JOBS=12 export NVCC_THREADS=1 rm -rf build vllm.egg-info rm -rf .deps/*-build .deps/*-subbuild pushd flash-attention-v100 python -m build --wheel --no-isolation --outdir ../dist-cu128-sm70-1.0.0 popd export VLLM_VERSION_OVERRIDE="1.0.0" python -m build --wheel --no-isolation --outdir dist-cu128-sm70-1.0.0 ``` If you want an editable source install instead of a wheel build: ```bash python -m pip install -e . --no-build-isolation ``` ## Public runtime defaults for V100 32 GB reference systems These are the public `1.0.0` reference configs we recommend writing into deployment docs. | Host | Model | TP | `max_model_len` | `max_num_seqs` | `max_num_batched_tokens` | Use case | | --- | --- | ---: | ---: | ---: | ---: | --- | | 4-card `32 GB` V100 | `Qwen3.5-27B-AWQ` | 4 | `262144` | `1` | `16384` | stable public default | | 4-card `32 GB` V100 | `Qwen3.6-27B-AWQ` + MTP | 4 | `262144` | `4` | `8192` | MTP + prefix-cache API serving | | 2-card `32 GB` V100 | `Qwen3.6-27B-AWQ` + MTP | 2 | `262144` | `1` | `8192` | memory-constrained MTP serving | | 4-card `32 GB` V100 | `Qwen3.6-35B-A3B-AWQ` | 4 | `262144` | `1` | `8192` | stable public default for MoE | | 4-card `32 GB` V100 | `Qwen3.5-122B-A10B-AWQ` | 4 | `262144` | `1` | `8096` | long-context large-model default | Important wording: - `FLASH_ATTN_V100` is the recommended attention backend for V100 in `1.0.0`. - Public baseline launch commands in this README default to 256K context (`max_model_len=262144`). If you publish or compare a new baseline, add its exact launch command to this README. - Keep `max_num_seqs=1` for the baseline public commands until your workload has been profiled locally. The MTP + prefix-cache profile intentionally uses `max_num_seqs=4`. - On V100, Qwen3.6/Qwen3.5 AWQ checkpoints with bundled MTP layers automatically use the public MTP4 server profile unless the related flags are overridden. - On 2 x 32 GB V100, keep the 27B MTP profile at `max_num_seqs=1`. The TP4 MTP setting `max_num_seqs=4` does not fit in 64 GB at 256K context. - On 32 GB V100 with `FLASH_ATTN_V100`, the baseline API server default is also capped at `max_num_seqs=1` to avoid upstream's high-concurrency default preallocating unnecessary KV cache and sampler/CUDAGraph buffers. - Do not pass `--disable-custom-all-reduce` for the 27B TP4 decode baseline. - `122B` uses a small prefill chunk budget to leave room for SM70 MoE temporary workspace during long-context serving. - `VLLM_SM70_ENABLE_DENSE_F16_FASTPATH=1` is not recommended for the 35B/122B MoE public commands. ## Launch examples All commands below are written as full runnable commands. When using the prebuilt wheels, run them outside the source checkout, for example after `cd ~`, so Python loads the installed wheel package and its CUDA extensions. The commands assume the `1Cat-vLLM` wheel is already installed in your active Python environment. Use `CUDA_VISIBLE_DEVICES=0,1,2,3` only when you need to select a specific four-card V100 set. ### Qwen3.5-27B-AWQ, TP4, public 4-card default ```bash python -m vllm.entrypoints.openai.api_server \ --model /path/to/Qwen3.5-27B-AWQ \ --served-model-name Qwen3.5-27B-AWQ \ --attention-backend FLASH_ATTN_V100 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.88 \ --max-model-len 262144 \ --max-num-seqs 1 \ --max-num-batched-tokens 16384 \ --host 0.0.0.0 \ --port 8000 ``` ### Qwen3.6-27B-AWQ, TP4, MTP + prefix cache, public 4-card default This is the recommended MTP serving profile for the `Qwen3.6-27B-AWQ` model on 4 x V100 32 GB. It keeps the public 256K context default, enables prefix cache for repeated prompts, and defaults to `num_speculative_tokens=4`. MTP8 can be faster on stable coding prompts, but MTP4 is the public default because it is more balanced on divergent prompts. ```bash python -m vllm.entrypoints.openai.api_server \ --model /path/to/Qwen3.6-27B-AWQ \ --served-model-name qwen3.6-27b-awq-mtp \ --trust-remote-code \ --tensor-parallel-size 4 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --host 0.0.0.0 \ --port 8000 ``` On V100, 1Cat-vLLM automatically applies the public Qwen3.6 MTP profile for this model family: - MTP4 speculative decoding. - 256K context from the model config. - `max_num_seqs=4` and `max_num_batched_tokens=8192` for TP4. - Prefix cache with `mamba_cache_mode=align`. - Text-only multimodal defaults to avoid unnecessary vision cache/profiling. - `gpu_memory_utilization=0.88`; KV auto-trim remains enabled unless you explicitly pass `--kv-cache-auto-trim-ratio 0`. - MTP4 CUDA graph capture sizes used by the validated TP4/TP2 profiles. If you need a non-default experiment, override the relevant flag explicitly. For example, use `--speculative-config '{"method":"mtp","num_speculative_tokens":8}'` to benchmark MTP8. Do not set `VLLM_SM70_ENABLE_DENSE_F16_FASTPATH=1` for this public MTP profile. That dense fast path is experimental and should be benchmarked separately from the stable serving command. If decode throughput is much lower than expected, check `/metrics` for the MTP acceptance length first. This profile should keep acceptance around 4 on the tested coding prompts; falling to about 1.5-2 usually means the prompt is too divergent for speculative decoding or a tuning flag was overridden. For speed-only experiments without prefix cache or tool calling, use `--max-num-seqs 1`, remove `--enable-prefix-caching`, `--enable-auto-tool-choice`, and `--tool-call-parser`, and benchmark `num_speculative_tokens` in `{2,4,6,8}` locally. ### Qwen3.6-27B-AWQ, TP2, MTP + prefix cache, 2-card 64 GB profile Use this profile for two 32 GB V100 cards. It keeps the 256K context limit, but uses `max_num_seqs=1` because the TP4 MTP concurrency setting does not fit on 64 GB. ```bash python -m vllm.entrypoints.openai.api_server \ --model /path/to/Qwen3.6-27B-AWQ \ --served-model-name qwen3.6-27b-awq-mtp-tp2 \ --trust-remote-code \ --tensor-parallel-size 2 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --host 0.0.0.0 \ --port 8000 ``` The same automatic profile is used, except TP2 defaults to `max_num_seqs=1` and `gpu_memory_utilization=0.849`. Do not copy the TP4 `max_num_seqs=4` value into this TP2 profile; it does not fit in 64 GB at 256K context. ### Qwen3.6-35B-A3B-AWQ, TP4, public 4-card default ```bash python -m vllm.entrypoints.openai.api_server \ --model /path/to/Qwen3.6-35B-A3B-AWQ \ --served-model-name Qwen3.6-35B-A3B-AWQ \ --attention-backend FLASH_ATTN_V100 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.88 \ --max-model-len 262144 \ --max-num-seqs 1 \ --max-num-batched-tokens 8192 \ --host 0.0.0.0 \ --port 8000 ``` ### Qwen3.5-122B-A10B-AWQ, TP4, long-context 4-card default ```bash python -m vllm.entrypoints.openai.api_server \ --model /path/to/Qwen3.5-122B-A10B-AWQ \ --served-model-name Qwen3.5-122B-A10B-AWQ \ --attention-backend FLASH_ATTN_V100 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.88 \ --max-model-len 262144 \ --max-num-seqs 1 \ --max-num-batched-tokens 8096 \ --host 0.0.0.0 \ --port 8000 ``` ## OpenAI-compatible request example ```bash curl http://127.0.0.1:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer EMPTY' \ -d '{ "model": "Qwen3.5-27B-AWQ", "messages": [{"role": "user", "content": "用一句话回答,2+2等于几?"}], "temperature": 0, "max_completion_tokens": 32, "chat_template_kwargs": {"enable_thinking": false} }' ``` If the first request returns `2+2 等于 4。`, the service is basically healthy. ## Optional experimental feature: FP8 KV cache This is not the default public recommendation, but it is worth documenting. - `fp8_e4m3` is not usable on V100 in the current Triton path - `fp8_e5m2` can be used experimentally - do **not** add `--calculate-kv-scales` Example: ```bash --kv-cache-dtype fp8_e5m2 ``` ## Known limits - This branch is optimized for **SM70 / Tesla V100**, not for all hardware. - Public launch commands default to 256K context with `max_model_len=262144`. - The public 27B command keeps `max_num_batched_tokens=16384`. - The public 35B and 122B commands use smaller prefill chunk budgets to leave room for MoE and long-context workspace. - Multimodal and vision workloads are not the default public profile for this release. - If you want guaranteed headroom for very long prompts, keep `--max-num-seqs 1` before increasing any other knob. ## Repository notes - The upstream project is **vLLM** - This fork focuses on **SM70 AWQ support and V100-oriented runtime tuning** - The public `1.0.0` README prioritizes: - prebuilt wheel installation - short model names in commands - `FLASH_ATTN_V100` as the recommended V100 attention backend - full runnable `python -m vllm.entrypoints.openai.api_server` commands ## Acknowledgements - [vLLM](https://github.com/vllm-project/vllm) - [lmdeploy / TurboMind](https://github.com/InternLM/lmdeploy) - [flash-attention-v100](https://github.com/ai-bond/flash-attention-v100) ## License This repository follows the upstream vLLM license model. See [LICENSE](LICENSE).