This example shows how to use Model Optimizer to calibrate and quantize the backbone part of diffusion models. The backbone part typically consumes >95% of the e2e Stable Diffusion latency.
We also provide instructions on deploying and running E2E stable diffusion pipelines with Model Optimizer quantized INT8 and FP8 UNet to generate images and measure latency on target GPUs. Note, Jetson devices are not supported so far.
You may choose to run this example with docker or by installing the required software by yourself.
Follow the instructions in ../../README.md
to build the docker image and run the docker container.
We assume you already installed NVIDIA TensorRT Model Optimizer modelopt
, now install required dependencies for Stable Diffusion:
pip install -r requirements.txt
You can run the following script to get the INT8 or FP8 UNet onnx model built with default settings for SDXL, and then directly go to Build the TRT engine for the Quantized ONNX UNet section to run E2E pipeline to generate images.
bash build_sdxl_8bit_engine.sh --format {FORMAT} # FORMAT can be int8 or fp8
If you prefer to customize parameters in calibration or run other models, please follow the instructions below.
We support calibration for both INT8 and FP8 precision and for both weights and activations.
Note: Model calibration requires relatively more GPU computing powers and it does not need to be on the same GPUs as the deployment target GPUs. Using the command line below will execute both calibration and ONNX export.
python quantize.py \
--model {sdxl-1.0|sdxl-turbo|sd1.5|sd3-medium} \
--format int8 --batch-size 2 \
--calib-size 32 --collect-method min-mean \
--percentile 1.0 --alpha 0.8 \
--quant-level 3.0 --n-steps 20 \
--exp-name {EXP_NAME} --onnx-dir {ONNX_DIR}
python quantize.py \
--model {sdxl-1.0|sdxl-turbo|sd1.5} \
--format fp8 --batch-size 2 --calib-size 128 --quant-level 4.0 \
--n-steps 20 --exp-name {EXP_NAME} --collect-method default \
--onnx-dir {ONNX_DIR}
We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. Quant-level 4.0 requires additional memory.
percentile
: Control quantization scaling factors (amax) collecting range, meaning that we will collect the chosen amax in the range of (n_steps * percentile)
steps. Recommendation: 1.0
alpha
: A parameter in SmoothQuant, used for linear layers only. Recommendation: 0.8 for SDXL, 1.0 for SD 1.5
quant-level
: Which layers to be quantized, 1: CNNs
, 2: CNN + FFN
, 2.5: CNN + FFN + QKV
, 3: CNN + Almost all Linear (Including FFN, QKV, Proj and others)
, 4: CNN + Almost all Linear + fMHA
. Recommendation: 2, 2.5 and 3, 4 is only for FP8, depending on the requirements for image quality & speedup. You might notice a slight difference between FP8 quant level 3.0 and 4.0, as we are currently working to enhance the performance of FP8 fMHA.
calib-size
: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended. For SD 1.5, set it to 512 or 1024.
n_steps
: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.
Then, we can load the generated checkpoint and export the INT8/FP8 quantized model in the next step.
For FP8, we only support the TRT deployment on Ada/Hopper GPUs w/wo plugins
We assume you already have TensorRT environment setup. INT8 requires TensorRT version >= 9.2.0. If you prefer to use the FP8 TensorRT OOTB path instead of the plugin path, ensure you have TensorRT version 10.2.0 or higher. You can download the latest version of TensorRT at here.
Before you build the TRT FP8 engine, please run this command line:
python onnx_utils/sdxl_fp8_onnx_graphsurgeon.py --onnx-path {YOUR_FP8_ONNX/backbone.onnx} --output-onnx {NEW_ONNX_FILE_PATH}
If you prefer using FP8 with plugins, we support TRT deployment on Ada and Hopper GPUs. Please refer to SDXL_FP8_README.md
for more information.
Then generate the INT8/FP8 UNet Engine
trtexec --builderOptimizationLevel=4 --stronglyTyped --onnx=./backbone.onnx \
--minShapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 \
--optShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
--maxShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
--saveEngine=backbone.plan
If you want to run end-to-end SD/SDXL pipeline with Model Optimizer quantized UNet to generate images and measure latency on target GPUs, here are the steps:
Clone a copy of demo/Diffusion repo.
Following the README from demoDiffusion to set up the pipeline, and run a baseline txt2img example (fp16):
# SDXL
python demo_txt2img_xl.py "enchanted winter forest, soft diffuse light on a snow-filled day, serene nature scene, the forest is illuminated by the snow" --negative-prompt "normal quality, low quality, worst quality, low res, blurry, nsfw, nude" --version xl-1.0 --scheduler Euler --denoising-steps 30 --seed 2946901
# Please refer to the examples provided in the demoDiffusion SD/SDXL pipeline.
Note, it will take some time to build TRT engines for the first time
cp -r {YOUR_UNETXL}.plan ./engine/
Note, the engines must be built on the same GPU, and ensure that the INT8 engine name matches the names of the FP16 engines to enable compatibility with the demoDiffusion pipeline.
![]() |
![]() |
SDXL FP16 | SDXL INT8 |
For optimal performance of INT8/FP8 quantized models, we highly recommend fusing the LoRA weights prior to quantization. Failing to do so can disrupt TensorRT kernel fusion when integrating the LoRA layer with INT8/FP8 Quantize-Dequantize (QDQ) nodes, potentially leading to performance losses.
Start by fusing the LoRA weights in your model. This process can help ensure that the model is optimized for quantization. Detailed guidance on how to fuse LoRA weights can be found in the Hugging Face PEFT documentation:
After fusing the weights, proceed with the calibration and you can follow our code to do the quantization.
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
).to("cuda")
pipe.load_lora_weights(
"CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy"
)
pipe.fuse_lora(lora_scale=0.9)
...
# All the LoRA layers should be fused
check_lora(pipe.unet)
mtq.quantize(pipe.unet, quant_config, forward_loop)
mto.save(pipe.unet, ...)
When it's time to export the model to ONNX format, ensure that you load the PEFT-modified LoRA model first.
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True,
)
pipe.load_lora_weights(
"CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy"
)
pipe.fuse_lora(lora_scale=0.9)
mto.restore(pipe.unet, your_quantized_ckpt)
...
# Export the onnx model
By following these steps, your PEFT LoRA model should be efficiently quantized using ModelOPT, ready for deployment while maximizing performance.
Stable Diffusion pipelines rely heavily on random sampling operations, which include creating Gaussian noise tensors to denoise and adding noise in the scheduling step. In the quantization recipe, we don't fix the random seed. As a result, every time you run the calibration pipeline, you could get different quantizer amax values. This may lead to the generated images being different from the ones generated with the original model. We suggest to run a few more times and choose the best one.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。