# AutoGPTQ **Repository Path**: james-hadoop/AutoGPTQ ## Basic Information - **Project Name**: AutoGPTQ - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-02-23 - **Last Updated**: 2025-02-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README
๐จ AutoGPTQ development has stopped. Please switch to GPTQModel as drop-in replacement. ๐จ
An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).
## News or Update - 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with [Marlin](https://github.com/IST-DASLab/marlin) int4*fp16 matrix multiplication kernel support, with the argument `use_marlin=True` when loading models. - 2023-08-23 - (News) - ๐ค Transformers, optimum and peft have integrated `auto-gptq`, so now running and training GPTQ models can be more available to everyone! See [this blog](https://huggingface.co/blog/gptq-integration) and it's resources for more details! *For more histories please turn to [here](docs/NEWS_OR_UPDATE.md)* ## Performance Comparison ### Inference Speed > The result is generated using [this script](examples/benchmark/generation_speed.py), batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better). > > The quantized model is loaded using the setup that can gain the fastest inference speed. | model | GPU | num_beams | fp16 | gptq-int4 | |---------------|---------------|-----------|-------|-----------| | llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 | | llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 | | moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 | | moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 | | moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 | | moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 | | gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 | | gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 | ### Perplexity For perplexity comparison, you can turn to [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#result) and [here](https://github.com/qwopqwop200/GPTQ-for-LLaMa#gptq-vs-bitsandbytes) ## Installation AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels: | Platform version | Installation | Built against PyTorch | |-------------------|---------------------------------------------------------------------------------------------------|-----------------------| | CUDA 11.8 | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/` | 2.2.1+cu118 | | CUDA 12.1 | `pip install auto-gptq --no-build-isolation` | 2.2.1+cu121 | | ROCm 5.7 | `pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/` | 2.2.1+rocm5.7 AutoGPTQ can be installed with the Triton dependency with `pip install auto-gptq[triton] --no-build-isolation` in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization). For older AutoGPTQ, please refer to [the previous releases installation table](docs/INSTALLATION.md). On NVIDIA systems, AutoGPTQ does not support [Maxwell or lower](https://qiita.com/uyuni/items/733a93b975b524f89f46) GPUs. ### Install from source Clone the source code: ```bash git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ ``` A few packages are required in order to build from source: `pip install numpy gekko pandas`. Then, install locally from source: ```bash pip install -vvv --no-build-isolation -e . ``` You can set `BUILD_CUDA_EXT=0` to disable pytorch extension building, but this is **strongly discouraged** as AutoGPTQ then falls back on a slow python implementation. As a last resort, if the above command fails, you can try `python setup.py install`. #### On ROCm systems To install from source for AMD GPUs supporting ROCm, please specify the `ROCM_VERSION` environment variable. Example: ```bash ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e . ``` The compilation can be speeded up by specifying the `PYTORCH_ROCM_ARCH` variable ([reference](https://github.com/pytorch/pytorch/blob/7b73b1e8a73a1777ebe8d2cd4487eb13da55b3ba/setup.py#L132)) in order to build for a single target device, for example `gfx90a` for MI200 series devices. For ROCm systems, the packages `rocsparse-dev`, `hipsparse-dev`, `rocthrust-dev`, `rocblas-dev` and `hipblas-dev` are required to build. #### On Intelยฎ Gaudiยฎ 2 systems >Notice: make sure you're in commit 65c2e15 or later To install from source for Intel Gaudi 2 HPUs, set the `BUILD_CUDA_EXT=0` environment variable to disable building the CUDA PyTorch extension. Example: ```bash BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e . ``` >Notice that Intel Gaudi 2 uses an optimized kernel upon inference, and requires `BUILD_CUDA_EXT=0` on non-CUDA machines. ## Quick Tour ### Quantization and Inference > warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good. Below is an example for the simplest use of `auto_gptq` to quantize a model and inference after quantization: ```python from transformers import AutoTokenizer, TextGenerationPipeline from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import logging logging.basicConfig( format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S" ) pretrained_model_dir = "facebook/opt-125m" quantized_model_dir = "opt-125m-4bit" tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True) examples = [ tokenizer( "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm." ) ] quantize_config = BaseQuantizeConfig( bits=4, # quantize model to 4-bit group_size=128, # it is recommended to set the value to 128 desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad ) # load un-quantized model, by default, the model will always be loaded into CPU memory model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config) # quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask" model.quantize(examples) # save quantized model model.save_quantized(quantized_model_dir) # save quantized model using safetensors model.save_quantized(quantized_model_dir, use_safetensors=True) # push quantized model to Hugging Face Hub. # to use use_auth_token=True, Login first via huggingface-cli login. # or pass explcit token with: use_auth_token="hf_xxxxxxx" # (uncomment the following three lines to enable this feature) # repo_id = f"YourUserName/{quantized_model_dir}" # commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}" # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True) # alternatively you can save and push at the same time # (uncomment the following three lines to enable this feature) # repo_id = f"YourUserName/{quantized_model_dir}" # commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}" # model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True) # load quantized model to the first GPU model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # download quantized model from Hugging Face Hub and load to the first GPU # model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False) # inference with model.generate print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0])) # or you can also use pipeline pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) print(pipeline("auto-gptq is")[0]["generated_text"]) ``` For more advanced features of model quantization, please reference to [this script](examples/quantization/quant_with_alpaca.py) ### Customize Model