docs/pytorch/models/models_evaluation.md · Ascend/MindSpeed-LLM

MindSpeed-LLM 支持大模型在公开基准数据集上进行准确率评估，当前支持的 Benchmark 如下：

Benchmark	下载链接	验证集	MindSpeed-LLM	OpenCompass
MMLU	GitHub	test	45.73%	45.3%
CEval	HuggingFace	val	33.87%	32.5%
BoolQ	GitHub	dev	75.44%	74.9%
BBH	GitHub	test	34.4%	32.5%
AGIEval	GitHub	test	20.6%	20.6%
HumanEval	GitHub	test	12.8%	12.2%
CMMLU	Kaggle	test	--	--
GSM8k	GitHub	--	--	--
Hellaswag	GitHub	--	--	--
Needlebench	HuggingFace	--	--	--

MindSpeed-LLM 已支持的大模型评估数据统计如下：

模型	任务	MindSpeed-LLM	社区	模型	任务	MindSpeed-LLM	社区
Aquila-7B	BoolQ	77.3%	--	Aquila2-7B	BoolQ	77.8%	--
Aquila2-34B	BoolQ	88.0%	--	Baichuan-7B	BoolQ	69.0%	67.0%
Baichuan-13B	BoolQ	74.7%	73.6%	Baichuan2-7B	BoolQ	70.0%	63.2%
Baichuan2-13B	BoolQ	78.0%	67.0%	Bloom-7B	MMLU	25.1%	--
Bloom-176B	BoolQ	64.5%	--	ChatGLM3-6B	MMLU	61.5%	--
GLM4-9B	MMLU	74.5%	74.7%	CodeQwen1.5-7B	Human.	54.8%	51.8%
CodeLLaMA-34B	HumanEval	48.8%	48.8%	Gemma-2B	MMLU	39.6%	--
Gemma-7B	MMLU	52.2%	--	InternLM-7B	MMLU	48.7%	51.0%
Gemma2-9B	MMLU	70.7%	71.3%	Gemma2-27B	MMLU	75.5%	75.2%
LLaMA-7B	BoolQ	74.6%	75.4%	LLaMA-13B	BoolQ	79.6%	78.7%
LLaMA-33B	BoolQ	83.2%	83.1%	LLaMA-65B	BoolQ	85.7%	86.6%
LLaMA2-7B	MMLU	45.7%	--	LLaMA2-13B	BoolQ	82.2%	81.7%
LLaMA2-34B	BoolQ	82.0%	--	LLaMA2-70B	BoolQ	86.4%	--
LLaMA3-8B	MMLU	65.2%	--	LLaMA3-70B	BoolQ	78.4%	--
LLaMA3.1-8B	MMLU	65.3%	--	LLaMA3.1-70B	MMLU	81.8%	--
LLaMA3.2-1B	MMLU	31.8%	32.2%	LLaMA3.2-3B	MMLU	56.3%	58.0%
Mistral-7B	MMLU	56.3%	--	Mixtral-8x7B	MMLU	70.6%	70.6%
Mistral-8x22B	MMLU	77%	77.8%	MiniCPM-MoE-8x2B	BoolQ	83.9%	--
QWen-7B	MMLU	58.1%	58.2%	Qwen-14B	MMLU	65.3%	66.3%
QWen-72B	MMLU	74.6%	77.4%	QWen1.5-0.5B	MMLU	39.1%	--
QWen1.5-1.8b	MMLU	46.2%	46.8%	QWen1.5-4B	MMLU	59.0%	56.1%
QWen1.5-7B	MMLU	60.3%	61.0%	QWen1.5-14B	MMLU	67.3%	67.6%
QWen1.5-32B	MMLU	72.5%	73.4%	QWen1.5-72B	MMLU	76.4%	77.5%
Qwen1.5-110B	MMLU	80.4%	80.4%	Yi-34B	MMLU	76.3%	75.8%
QWen2-0.5B	MMLU	44.6%	45.4%	QWen2-1.5B	MMLU	54.7%	56.5%
QWen2-7B	MMLU	70.3%	70.3%	QWen2-57B-A14B	MMLU	75.6%	76.5%
QWen2-72B	MMLU	83.6%	84.2%	MiniCPM-2B	MMLU	51.6%	53.4%
DeepSeek-V2-Lite-16B	MMLU	58.1%	58.3%	QWen2.5-0.5B	MMLU	47.67%	47.5%
QWen2.5-1.5B	MMLU	59.4%	60.9%	QWen2.5-3B	MMLU	65.6%	65.6%
QWen2.5-7B	MMLU	73.8%	74.2%	QWen2.5-14B	MMLU	79.4%	79.7%
QWen2.5-32B	MMLU	83.3%	83.3%	QWen2.5-72B	MMLU	85.59%	86.1%
InternLM2.5-1.8b	MMLU	51.3%	53.5%	InternLM2.5-7B	MMLU	71.6%	71.6%
InternLM2.5-20b	MMLU	73.3%	74.2%	InternLM3-8b	MMLU	76.6%	76.6%
Yi1.5-6B	MMLU	63.2%	63.5%	Yi1.5-9B	MMLU	69.2%	69.5%
Yi1.5-34B	MMLU	76.9%	77.1%	CodeQWen2.5-7B	Human.	66.5%	61.6%
Qwen2.5-Math-7B	MMLU-STEM	67.8%	67.8%	Qwen2.5-Math-72B	MMLU-STEM	83.7%	82.8%
MiniCPM3-4B	MMLU	63.7%	64.6%	Phi-3.5-mini-instruct	MMLU	64.39%	64.34%
Phi-3.5-MoE-instruct	MMLU	78.5%	78.9%	DeepSeek-Math-7B	MMLU-STEM	56.5%	56.5%
DeepSeek-V2.5	MMLU	79.3%	80.6%	DeepSeek-V2-236B	MMLU	78.1%	78.5%
LLaMA3.3-70B-Instruct	MMLU	82.7%	--	QwQ-32B	MMLU	81.19%	--