This folder includes popular 3rd-party LLM benchmarks for LLM accuracy evaluation.
The following instructions show how to evaluate the Model Optimizer quantized LLM with the benchmarks, including the TensorRT-LLM deployment.
Massive Multitask Language Understanding. A score (0-1, higher is better) will be printed at the end of the benchmark.
Download data
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
cd ..
python mmlu.py --model_name causal --model_path <HF model folder or model card>
# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder>
HumanEval. A score (0-1, higher is better) will be printed at the end of the benchmark.
Due to various prompt and generation postprocessing methods, the final score might be different compared with the published numbers from the model developer.
Clone Instruct-eval and add a softlink to folder human_eval from instruct_eval/
python humaneval.py --model_name causal --model_path <HF model folder or model card> --n_sample 1
# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
python humaneval.py --model_name causal --model_path <HF model folder or model card> --n_sample 1 --quant_cfg MODELOPT_QUANT_CFG
python humaneval.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder> --n_sample 1
MT-Bench. These reponses are generated using FastChat.
bash run_fastchat.sh <HF model folder or model card>
bash run_fastchat.sh <HF model folder or model card> <built TensorRT-LLM folder>
The responses to questions from MT Bench will be stored under data/mt_bench/model_answer
.
The quality of the responses can be judged using llm_judge from the FastChat repository. Please refer to the llm_judge to compute the final MT-Bench score.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。