# hqq
**Repository Path**: mirrors_dropbox/hqq
## Basic Information
- **Project Name**: hqq
- **Description**: Official implementation of Half-Quadratic Quantization (HQQ)
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-24
- **Last Updated**: 2026-02-14
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## Half-Quadratic Quantization (HQQ)
This repository contains the official implementation of Half-Quadratic Quantization (HQQ) presented in our articles:
* HQQ: https://dropbox.github.io/hqq_blog/
* HQQ+: https://dropbox.github.io/1bit_blog/
### What is HQQ?
HQQ is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.
FAQ
Why should I use HQQ instead of other quantization methods?
- HQQ is very fast to quantize models.
- It supports 8,4,3,2,1 bits.
- You can use it on any model (LLMs, Vision, etc.).
- The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.
- HQQ is compatible with peft training.
- We try to make HQQ fully compatible `torch.compile` for faster inference and training.
What is the quality of the quantized models?
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
What is the speed of the quantized models?
4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below.
What quantization settings should I use?
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference.
What does the `axis` parameter mean?
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.
### Installation
First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/
You can install hqq via
```
#latest stable version
pip install hqq;
#Latest updates - recommended
pip install git+https://github.com/dropbox/hqq.git;
#Disable building the CUDA kernels for the aten backend
DISABLE_CUDA=1 pip install ...
```
Alternatively, clone the repo and run ```pip install .``` from this current folder.
### Basic Usage
To perform quantization with HQQ, you simply need to replace the linear layers ( ```torch.nn.Linear```) as follows:
```Python
from hqq.core.quantize import *
#Quantization settings
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
#Replace your linear layer
hqq_layer = HQQLinear(your_linear_layer, #torch.nn.Linear or None
quant_config=quant_config, #quantization configuration
compute_dtype=torch.float16, #compute dtype
device='cuda', #cuda device
initialize=True, #Use False to quantize later
del_orig=True #if True, delete the original layer
)
W_r = hqq_layer.dequantize() #dequantize()
W_q = hqq_layer.unpack(dtype=torch.uint8) #unpack
y = hqq_layer(x) #forward-pass
```
The quantization parameters are set as follows:
- ```nbits``` (int): supports 8, 4, 3, 2, 1 bits.
- ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.
- ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of an int type.
### Usage with Models
#### Transformers 🤗
For usage with HF's transformers, see the example below from the documentation:
```Python
from transformers import AutoModelForCausalLM, HqqConfig
# All linear layers will use the same quantization config
quant_config = HqqConfig(nbits=4, group_size=64)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quant_config
)
```
You can save/load quantized models as regular transformers models via `save_pretrained` / `from_pretrained`.
#### HQQ Lib
You can also utilize the HQQ library to quantize transformers models:
```Python
#Load the model on CPU
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=compute_dtype)
#Quantize
from hqq.models.hf.base import AutoHQQHFModel
quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype, device=device)
```
You can save/load quantized models as follows:
```Python
from hqq.models.hf.base import AutoHQQHFModel
#Save: Make sure to save the model BEFORE any patching
AutoHQQHFModel.save_quantized(model, save_dir)
#Save as safetensors (to be load via transformers or vllm)
AutoHQQHFModel.save_to_safetensors(model, save_dir)
#Load
model = AutoHQQHFModel.from_quantized(save_dir)
```
❗ Note that models saved via the hqq lib are not compatible with `.from_pretrained()`
### Backends
#### Native Backends
The following native dequantization backends can be used by the `HQQLinear` module:
```Python
HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch
HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported
```
❗ Note that ```HQQBackend.ATEN``` only supports `axis=0`.
#### Optimized Inference
We support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows:
```Python
from hqq.utils.patching import prepare_for_inference
#Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings
#prepare_for_inference(model)
#Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1
prepare_for_inference(model, backend="gemlite")
#Torchao's tiny_gemm backend (fast for batch-size<4): nbits=4, compute_dtype=bfloat16, axis=1
#prepare_for_inference(model, backend="torchao_int4")
```
Note that these backends only work with `axis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens/sec with a Llama3-8B 4-bit quantized model on a 4090 RTX.
When a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend.
### Custom Quantization Configurations ⚙️
You can set up various quantization configurations for different layers by specifying the settings for each layer name:
#### Transformers 🤗
```Python
# Each linear layer with the same tag will use a dedicated quantization config
q4_config = {'nbits':4, 'group_size':64}
q3_config = {'nbits':3, 'group_size':32}
quant_config = HqqConfig(dynamic_config={
'self_attn.q_proj':q4_config,
'self_attn.k_proj':q4_config,
'self_attn.v_proj':q4_config,
'self_attn.o_proj':q4_config,
'mlp.gate_proj':q3_config,
'mlp.up_proj' :q3_config,
'mlp.down_proj':q3_config,
})
```
#### HQQ lib
```Python
from hqq.core.quantize import *
q4_config = BaseQuantizeConfig(nbits=4, group_size=64)
q3_config = BaseQuantizeConfig(nbits=3, group_size=32)
quant_config = {'self_attn.q_proj':q4_config,
'self_attn.k_proj':q4_config,
'self_attn.v_proj':q4_config,
'self_attn.o_proj':q4_config,
'mlp.gate_proj':q3_config,
'mlp.up_proj' :q3_config,
'mlp.down_proj':q3_config,
}
```
### VLLM
You can use HQQ in vllm. Make sure to install GemLite before using the backend.
```Python
#Or you can quantize on-the-fly
from hqq.utils.vllm import set_vllm_onthefly_hqq_quant
skip_modules = ['lm_head', 'visual', 'vision']
#Select one of the following modes:
#INT/FP format
set_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='int8_weightonly', skip_modules=skip_modules) #A16W8 - INT8 weight only
set_vllm_onthefly_hqq_quant(weight_bits=4, group_size=128, quant_mode='int4_weightonly', skip_modules=skip_modules) #A16W4 - HQQ weight only
set_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='int8_dynamic', skip_modules=skip_modules) #A8W8 - INT8 x INT8 dynamic
set_vllm_onthefly_hqq_quant(weight_bits=8, quant_mode='fp8_dynamic', skip_modules=skip_modules) #A8W8 - FP8 x FP8 dynamic
#MXFP format
set_vllm_onthefly_hqq_quant(weight_bits=8, group_size=None, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8 - post_scale=True
set_vllm_onthefly_hqq_quant(weight_bits=8, group_size=32, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W8 - MXFP8 x MXPF8- post_scale=False
set_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_weightonly', skip_modules=skip_modules) #A16W4 - MXFP4 weight-only
set_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp8_dynamic', skip_modules=skip_modules) #A8W4 - MXFP8 x MXFP4 dynamic
set_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='mxfp4_dynamic', skip_modules=skip_modules) #A4W4 - MXPF4 x MXPF4 dynamic
set_vllm_onthefly_hqq_quant(weight_bits=4, quant_mode='nvfp4_dynamic', skip_modules=skip_modules) #A4W4 - NVFP4 x NVFP4 dynamic
llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", max_model_len=4096, gpu_memory_utilization=0.80, dtype=torch.float16)
```
### Peft Training
Peft training is directly supported in the HuggingFace's peft library. If you still want to use hqq-lib's peft utilities, here's how:
```Python
#First, quantize/load a quantized HQQ model the
from hqq.core.peft import PeftUtils
base_lora_params = {'lora_type':'default', 'r':32, 'lora_alpha':64, 'dropout':0.05, 'train_dtype':torch.float32}
lora_params = {'self_attn.q_proj': base_lora_params,
'self_attn.k_proj': base_lora_params,
'self_attn.v_proj': base_lora_params,
'self_attn.o_proj': base_lora_params,
'mlp.gate_proj' : None,
'mlp.up_proj' : None,
'mlp.down_proj' : None}
#Add LoRA to linear/HQQ modules
PeftUtils.add_lora(model, lora_params)
#Optional: set your backend
HQQLinear.set_backend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCH_COMPILE)
#Train ....
#Convert LoRA weights to the same model dtype for faster inference
model.eval()
PeftUtils.cast_lora_weights(model, dtype=compute_dtype)
#Save LoRA weights
PeftUtils.save_lora_weights(model, filename)
#Load LoRA weights: automatically calls add_lora
PeftUtils.load_lora_weights(model, filename)
```
We provide a complete example to train a model with HQQ/LoRA that you can find in ```examples/hqq_plus.py```.
If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora
### Examples
We provide a variety of examples demonstrating model quantization across different backends within the ```examples``` directory.
### Citation 📜
```
@misc{badri2023hqq,
title = {Half-Quadratic Quantization of Large Machine Learning Models},
url = {https://dropbox.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month = {November},
year = {2023}
```