1 Star 0 Fork 0

limit/TensorRT-Model-Optimizer

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

Stable Diffusion XL (Base/Turbo) and Stable Diffusion 1.5 Quantization with Model Optimizer

This example shows how to use Model Optimizer to calibrate and quantize the backbone part of diffusion models. The backbone part typically consumes >95% of the e2e Stable Diffusion latency.

We also provide instructions on deploying and running E2E stable diffusion pipelines with Model Optimizer quantized INT8 and FP8 UNet to generate images and measure latency on target GPUs. Note, Jetson devices are not supported so far.

Get Started

You may choose to run this example with docker or by installing the required software by yourself.

Docker

Follow the instructions in ../../README.md to build the docker image and run the docker container.

By yourself

We assume you already installed NVIDIA TensorRT Model Optimizer modelopt, now install required dependencies for Stable Diffusion:

pip install -r requirements.txt

8-bit ONNX Export Quick Start

You can run the following script to get the INT8 or FP8 UNet onnx model built with default settings for SDXL, and then directly go to Build the TRT engine for the Quantized ONNX UNet section to run E2E pipeline to generate images.

bash build_sdxl_8bit_engine.sh --format {FORMAT} # FORMAT can be int8 or fp8

If you prefer to customize parameters in calibration or run other models, please follow the instructions below.

Calibration with Model Optimizer

We support calibration for both INT8 and FP8 precision and for both weights and activations.

Note: Model calibration requires relatively more GPU computing powers and it does not need to be on the same GPUs as the deployment target GPUs. Using the command line below will execute both calibration and ONNX export.

SD3-Medium|SDXL|SD1.5|SDXL-Turbo INT8

python quantize.py \
  --model {sdxl-1.0|sdxl-turbo|sd1.5|sd3-medium} \
  --format int8 --batch-size 2 \
  --calib-size 32 --collect-method min-mean \
  --percentile 1.0 --alpha 0.8 \
  --quant-level 3.0 --n-steps 20 \
  --exp-name {EXP_NAME} --onnx-dir {ONNX_DIR}

SDXL|SD1.5|SDXL-Turbo FP8

python quantize.py \
  --model {sdxl-1.0|sdxl-turbo|sd1.5} \
  --format fp8 --batch-size 2 --calib-size 128 --quant-level 4.0 \
  --n-steps 20 --exp-name {EXP_NAME} --collect-method default \
  --onnx-dir {ONNX_DIR}

We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. Quant-level 4.0 requires additional memory.

Important Parameters

  • percentile: Control quantization scaling factors (amax) collecting range, meaning that we will collect the chosen amax in the range of (n_steps * percentile) steps. Recommendation: 1.0

  • alpha: A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL, 1.0 for SD 1.5

  • quant-level: Which layers to be quantized, 1: CNNs, 2: CNN + FFN, 2.5: CNN + FFN + QKV, 3: CNN + Almost all Linear (Including FFN, QKV, Proj and others), 4: CNN + Almost all Linear + fMHA. Recommendation: 2, 2.5 and 3, 4 is only for FP8, depending on the requirements for image quality & speedup. You might notice a slight difference between FP8 quant level 3.0 and 4.0, as we are currently working to enhance the performance of FP8 fMHA.

  • calib-size: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended. For SD 1.5, set it to 512 or 1024.

  • n_steps: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.

Then, we can load the generated checkpoint and export the INT8/FP8 quantized model in the next step.

For FP8, we only support the TRT deployment on Ada/Hopper GPUs w/wo plugins

Build the TRT engine for the Quantized ONNX UNet

We assume you already have TensorRT environment setup. INT8 requires TensorRT version >= 9.2.0. If you prefer to use the FP8 TensorRT OOTB path instead of the plugin path, ensure you have TensorRT version 10.2.0 or higher. You can download the latest version of TensorRT at here.

Before you build the TRT FP8 engine, please run this command line:

python onnx_utils/sdxl_fp8_onnx_graphsurgeon.py --onnx-path {YOUR_FP8_ONNX/backbone.onnx} --output-onnx {NEW_ONNX_FILE_PATH}

If you prefer using FP8 with plugins, we support TRT deployment on Ada and Hopper GPUs. Please refer to SDXL_FP8_README.md for more information.

Then generate the INT8/FP8 UNet Engine

trtexec --builderOptimizationLevel=4 --stronglyTyped --onnx=./backbone.onnx \
  --minShapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 \
  --optShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
  --maxShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
  --saveEngine=backbone.plan

Run End-to-end Stable Diffusion Pipeline with Model Optimizer Quantized ONNX Model and demoDiffusion

If you want to run end-to-end SD/SDXL pipeline with Model Optimizer quantized UNet to generate images and measure latency on target GPUs, here are the steps:

  • Clone a copy of demo/Diffusion repo.

  • Following the README from demoDiffusion to set up the pipeline, and run a baseline txt2img example (fp16):

# SDXL
python demo_txt2img_xl.py "enchanted winter forest, soft diffuse light on a snow-filled day, serene nature scene, the forest is illuminated by the snow" --negative-prompt "normal quality, low quality, worst quality, low res, blurry, nsfw, nude" --version xl-1.0 --scheduler Euler --denoising-steps 30 --seed 2946901
# Please refer to the examples provided in the demoDiffusion SD/SDXL pipeline.

Note, it will take some time to build TRT engines for the first time

cp -r {YOUR_UNETXL}.plan ./engine/

Note, the engines must be built on the same GPU, and ensure that the INT8 engine name matches the names of the FP16 engines to enable compatibility with the demoDiffusion pipeline.

  • Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8. Similarly, you could run end-to-end SD1.5 or SDXL-turbo pipeline with Model Optimizer quantized UNet and corresponding examples in demoDiffusion.

Demo Images

FP16 INT8
SDXL FP16 SDXL INT8

LoRA

For optimal performance of INT8/FP8 quantized models, we highly recommend fusing the LoRA weights prior to quantization. Failing to do so can disrupt TensorRT kernel fusion when integrating the LoRA layer with INT8/FP8 Quantize-Dequantize (QDQ) nodes, potentially leading to performance losses.

Recommended Workflow:

Start by fusing the LoRA weights in your model. This process can help ensure that the model is optimized for quantization. Detailed guidance on how to fuse LoRA weights can be found in the Hugging Face PEFT documentation:

After fusing the weights, proceed with the calibration and you can follow our code to do the quantization.

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
).to("cuda")
pipe.load_lora_weights(
    "CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy"
)
pipe.fuse_lora(lora_scale=0.9)
...
# All the LoRA layers should be fused
check_lora(pipe.unet)

mtq.quantize(pipe.unet, quant_config, forward_loop)
mto.save(pipe.unet, ...)

When it's time to export the model to ONNX format, ensure that you load the PEFT-modified LoRA model first.

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
pipe.load_lora_weights(
    "CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy"
)
pipe.fuse_lora(lora_scale=0.9)
mto.restore(pipe.unet, your_quantized_ckpt)
...
# Export the onnx model

By following these steps, your PEFT LoRA model should be efficiently quantized using ModelOPT, ready for deployment while maximizing performance.

Notes About Randomness

Stable Diffusion pipelines rely heavily on random sampling operations, which include creating Gaussian noise tensors to denoise and adding noise in the scheduling step. In the quantization recipe, we don't fix the random seed. As a result, every time you run the calibration pipeline, you could get different quantizer amax values. This may lead to the generated images being different from the ones generated with the original model. We suggest to run a few more times and choose the best one.

马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/limitmhw/TensorRT-Model-Optimizer.git
git@gitee.com:limitmhw/TensorRT-Model-Optimizer.git
limitmhw
TensorRT-Model-Optimizer
TensorRT-Model-Optimizer
main

搜索帮助