Evaluation scripts for LLM tasks

This folder includes popular 3rd-party LLM benchmarks for LLM accuracy evaluation.

The following instructions show how to evaluate the Model Optimizer quantized LLM with the benchmarks, including the TensorRT-LLM deployment.

MMLU

Massive Multitask Language Understanding. A score (0-1, higher is better) will be printed at the end of the benchmark.

Setup

Download data

mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu
cd ..

Baseline

python mmlu.py --model_name causal --model_path <HF model folder or model card>

Quantized (simulated)

# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG

Evaluate the TensorRT-LLM engine

python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder>

Human-eval

HumanEval. A score (0-1, higher is better) will be printed at the end of the benchmark.

Due to various prompt and generation postprocessing methods, the final score might be different compared with the published numbers from the model developer.

Setup

Clone Instruct-eval and add a softlink to folder human_eval from instruct_eval/

Baseline

python humaneval.py --model_name causal --model_path <HF model folder or model card> --n_sample 1

Quantized (simulated)

# MODELOPT_QUANT_CFG: Choose from [INT8_SMOOTHQUANT_CFG|FP8_DEFAULT_CFG|INT4_AWQ_CFG|W4A8_AWQ_BETA_CFG]
python humaneval.py --model_name causal --model_path <HF model folder or model card> --n_sample 1 --quant_cfg MODELOPT_QUANT_CFG

Evaluate the TRT-LLM engine

python humaneval.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder> --n_sample 1

MT-Bench

MT-Bench. These reponses are generated using FastChat.

Baseline

bash run_fastchat.sh <HF model folder or model card>

Evaluate the TensorRT-LLM engine

bash run_fastchat.sh <HF model folder or model card> <built TensorRT-LLM folder>

Judging the responses

The responses to questions from MT Bench will be stored under data/mt_bench/model_answer. The quality of the responses can be judged using llm_judge from the FastChat repository. Please refer to the llm_judge to compute the final MT-Bench score.

limit/TensorRT-Model-Optimizer

Evaluation scripts for LLM tasks

MMLU

Setup

Baseline

Quantized (simulated)

Evaluate the TensorRT-LLM engine

Human-eval

Setup

Baseline

Quantized (simulated)

Evaluate the TRT-LLM engine

MT-Bench

Baseline

Evaluate the TensorRT-LLM engine

Judging the responses

简介

发行版

贡献者

语言

近期动态

limit/TensorRT-Model-Optimizer .gitee-modal { width: 500px !important; }

Evaluation scripts for LLM tasks

MMLU

Setup

Baseline

Quantized (simulated)

Evaluate the TensorRT-LLM engine

Human-eval

Setup

Baseline

Quantized (simulated)

Evaluate the TRT-LLM engine

MT-Bench

Baseline

Evaluate the TensorRT-LLM engine

Judging the responses

简介

发行版

贡献者

语言

近期动态

搜索帮助

limit/TensorRT-Model-Optimizer