Calculate how much GPU memory you need & breakdown of where it goes for training/inference of any LLM model with quantization (GGML/bitsandbytes), inference frameworks (vLLM/llama.cpp/HF) & QLoRA.
Link: https://rahulschand.github.io/gpu_poor/
I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following
The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It looks like below
{
"Total": 4000,
"KV Cache": 1000,
"Model Size": 2000,
"Activation Memory": 500,
"Grad & Optimizer memory": 0,
"cuda + other overhead": 500
}
Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.
The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB.
Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead
.bin
file size (divide it by 2 if Q8 quant & by 4 if Q4 quant).(2 x sequence length x hidden size)
per layer. For huggingface this (2 x 2 x sequence length x hidden size)
per layer. In training the whole sequence is processed at once (therefore KV cache memory = 0).backward()
. For example if you do output = Q * input
where Q = (dim, dim)
and input = (batch, seq, dim)
then output of shape (batch, seq, dim)
will need to be stored (in fp16). This consumes the most memory in LoRA/QLoRA. In LLMs there are many such intermediate steps (after Q,K,V and after attention, after norm, after FFN1, FFN2, FFN3, after skip layer ....) Around 15 intermediate representations are saved per layer..grad
tensors & tensors associated with the optimizer (running avg
etc.)Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。