# falcontune **Repository Path**: RapidAI/falcontune ## Basic Information - **Project Name**: falcontune - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-06-03 - **Last Updated**: 2023-06-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # falcontune: 4-Bit Finetuning of FALCONs on a Consumer GPU **falcontune** allows finetuning FALCONs (e.g., falcon-40b-4bit) on as little as one consumer-grade A100 40GB. Its features tiny and easy-to-use codebase. One benefit of being able to finetune larger LLMs on one GPU is the ability to easily leverage data parallelism for large models. Underneath the hood, **falcontune** implements the LoRA algorithm over an LLM compressed using the GPTQ algorithm, which requires implementing a backward pass for the quantized LLM. **falcontune** can generate a 50-token recipe on A100 40GB for ~ 10 seconds using triton backend ``` $ falcontune generate --interactive --model falcon-40b-instruct-4bit --weights gptq_model-4bit--1g.safetensors --max_new_tokens=50 --use_cache --do_sample --prompt "How to prepare pasta?" How to prepare pasta? Here's a simple recipe to prepare pasta: Ingredients: - 1 pound of dry pasta - 4-6 cups of water - Salt (optional) Instructions: 1. Boil the water Took 10.042 s ``` This example is based on the model: TheBloke/falcon-40b-instruct-GPTQ. Here is a [Google Colab](https://colab.research.google.com/drive/1Pv7Dn60u_ANgkhRojAIX-VOkU3J-2cYh?usp=sharing). You will need a A100 40GB to finetune the model. ## Installation ### Setup ``` pip install -r requirements.txt python setup.py install ``` The default backend is triton which is the fastest. For cuda support install also the CUDA kernels: ``` python setup_cuda.py install ``` ## Running falcontune The above process installs a `falcontune` command in your environment. ### Download Models First, start by downloading the weights of a FALCON model: ``` $ wget https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ/resolve/main/gptq_model-4bit--1g.safetensors ``` ### Generate Text You can generate text directly from the command line. This generates text from the base model: ``` $ falcontune generate \ --interactive \ --model falcon-40b-instruct-4bit \ --weights gptq_model-4bit--1g.safetensors \ --max_new_tokens=50 \ --use_cache \ --do_sample \ --instruction "Who was the first person on the moon?" ``` ### Finetune A Base Model You may also finetune a base model yourself. First, you need to download a dataset: ``` $ wget https://github.com/gururise/AlpacaDataCleaned/raw/main/alpaca_data_cleaned.json ``` You can finetune any model of the FALCON family:

FALCON-7B

$ falcontune finetune \ --model=falcon-7b \ --weights=tiiuae/falcon-7b \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-7b-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-7b \ --weights tiiuae/falcon-7b \ --lora_apply_dir falcon-7b-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

FALCON-7B-INSTRUCT

$ falcontune finetune \ --model=falcon-7b-instruct \ --weights=tiiuae/falcon-7b-instruct \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-7b-instruct-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-instruct-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-7b-instruct \ --weights mosaicml/falcon-7b-instruct \ --lora_apply_dir falcon-7b-instruct-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

FALCON-40B

$ falcontune finetune \ --model=falcon-40b \ --weights=tiiuae/falcon-40b \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-40b-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-40b \ --weights tiiuae/falcon-40b\ --lora_apply_dir falcon-40b-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

FALCON-40B-INSTRUCT

$ falcontune finetune \ --model=falcon-40b-instruct \ --weights=tiiuae/falcon-40b-instruct \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-40b-instruct-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-instruct-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-40b-instruct \ --weights tiiuae/falcon-40b-instruct\ --lora_apply_dir falcon-40b-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

FALCON-7B-INSTRUCT-4BIT

$ wget https://huggingface.co/TheBloke/falcon-7b-instruct-GPTQ/resolve/main/gptq_model-4bit-64g.safetensors $ falcontune finetune \ --model=falcon-7b-instruct-4bit \ --weights=gptq_model-4bit-64g.safetensors \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-7b-instruct-4bit-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-7b-instruct-4bit-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-7b-instruct-4bit \ --weights gptq_model-4bit-64g.safetensors \ --lora_apply_dir falcon-7b-instruct-4bit-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

FALCON-40B-INSTRUCT-4BIT

$ wget https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ/resolve/main/gptq_model-4bit--1g.safetensors $ falcontune finetune \ --model=falcon-40b-instruct-4bit \ --weights=gptq_model-4bit--1g.safetensors \ --dataset=./alpaca_data_cleaned.json \ --data_type=alpaca \ --lora_out_dir=./falcon-40b-instruct-4bit-alpaca/ \ --mbatch_size=1 \ --batch_size=2 \ --epochs=3 \ --lr=3e-4 \ --cutoff_len=256 \ --lora_r=8 \ --lora_alpha=16 \ --lora_dropout=0.05 \ --warmup_steps=5 \ --save_steps=50 \ --save_total_limit=3 \ --logging_steps=5 \ --target_modules='["query_key_value"]' The above commands will download the model and use LoRA to finetune the quantized model. The final adapters and the checkpoints will be saved in `falcon-40b-instruct-4bit-alpaca` and available for generation as follows: $ falcontune generate \ --interactive \ --model falcon-40b-instruct-4bit \ --weights gptq_model-4bit--1g.safetensors \ --lora_apply_dir falcon-40b-instruct-4bit-alpaca \ --max_new_tokens 50 \ --use_cache \ --do_sample \ --instruction "How to prepare pasta?"

## Acknowledgements **falcontune** is based on the following projects: * The GPTQ algorithm and codebase by the [IST-DASLAB](https://github.com/IST-DASLab/gptq) with modifications by [@qwopqwop200](https://github.com/qwopqwop200/) * The `alpaca_lora_4bit` repo by [johnsmith0031](https://github.com/johnsmith0031) * The PEFT repo and its implementation of LoRA * The LLAMA, OPT, and BLOOM models by META FAIR and the BigScience consortium * The `llmtune` repo by [kuleshov-group](https://github.com/kuleshov-group/llmtune) ## Consultations Need a custom solution? Let me know: `r.m.mihaylov@gmail.com`