197 Star 1.3K Fork 1.2K

GVPAscend/MindSpeed-LLM

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
克隆/下载
models_evaluation.md 13.68 KB
一键复制 编辑 原始数据 按行查看 历史
shenjiarun 提交于 2个月前 . !2775Refine Evaluation README

MindSpeed-LLM 支持大模型在公开基准数据集上进行准确率评估,当前支持的 Benchmark 如下:

Benchmark 下载链接 验证集 MindSpeed-LLM OpenCompass
MMLU GitHub test 45.73% 45.3%
CEval HuggingFace val 33.87% 32.5%
BoolQ GitHub dev 75.44% 74.9%
BBH GitHub test 34.4% 32.5%
AGIEval GitHub test 20.6% 20.6%
HumanEval GitHub test 12.8% 12.2%
CMMLU Kaggle test -- --
GSM8k GitHub -- -- --
Hellaswag GitHub -- -- --
Needlebench HuggingFace -- -- --

MindSpeed-LLM 已支持的大模型评估数据统计如下:

模型 任务 MindSpeed-LLM 社区 模型 任务 MindSpeed-LLM 社区
Aquila-7B BoolQ 77.3% -- Aquila2-7B BoolQ 77.8% --
Aquila2-34B BoolQ 88.0% -- Baichuan-7B BoolQ 69.0% 67.0%
Baichuan-13B BoolQ 74.7% 73.6% Baichuan2-7B BoolQ 70.0% 63.2%
Baichuan2-13B BoolQ 78.0% 67.0% Bloom-7B MMLU 25.1% --
Bloom-176B BoolQ 64.5% -- ChatGLM3-6B MMLU 61.5% --
GLM4-9B MMLU 74.5% 74.7% CodeQwen1.5-7B Human. 54.8% 51.8%
CodeLLaMA-34B HumanEval 48.8% 48.8% Gemma-2B MMLU 39.6% --
Gemma-7B MMLU 52.2% -- InternLM-7B MMLU 48.7% 51.0%
Gemma2-9B MMLU 70.7% 71.3% Gemma2-27B MMLU 75.5% 75.2%
LLaMA-7B BoolQ 74.6% 75.4% LLaMA-13B BoolQ 79.6% 78.7%
LLaMA-33B BoolQ 83.2% 83.1% LLaMA-65B BoolQ 85.7% 86.6%
LLaMA2-7B MMLU 45.7% -- LLaMA2-13B BoolQ 82.2% 81.7%
LLaMA2-34B BoolQ 82.0% -- LLaMA2-70B BoolQ 86.4% --
LLaMA3-8B MMLU 65.2% -- LLaMA3-70B BoolQ 78.4% --
LLaMA3.1-8B MMLU 65.3% -- LLaMA3.1-70B MMLU 81.8% --
LLaMA3.2-1B MMLU 31.8% 32.2% LLaMA3.2-3B MMLU 56.3% 58.0%
Mistral-7B MMLU 56.3% -- Mixtral-8x7B MMLU 70.6% 70.6%
Mistral-8x22B MMLU 77% 77.8% MiniCPM-MoE-8x2B BoolQ 83.9% --
QWen-7B MMLU 58.1% 58.2% Qwen-14B MMLU 65.3% 66.3%
QWen-72B MMLU 74.6% 77.4% QWen1.5-0.5B MMLU 39.1% --
QWen1.5-1.8b MMLU 46.2% 46.8% QWen1.5-4B MMLU 59.0% 56.1%
QWen1.5-7B MMLU 60.3% 61.0% QWen1.5-14B MMLU 67.3% 67.6%
QWen1.5-32B MMLU 72.5% 73.4% QWen1.5-72B MMLU 76.4% 77.5%
Qwen1.5-110B MMLU 80.4% 80.4% Yi-34B MMLU 76.3% 75.8%
QWen2-0.5B MMLU 44.6% 45.4% QWen2-1.5B MMLU 54.7% 56.5%
QWen2-7B MMLU 70.3% 70.3% QWen2-57B-A14B MMLU 75.6% 76.5%
QWen2-72B MMLU 83.6% 84.2% MiniCPM-2B MMLU 51.6% 53.4%
DeepSeek-V2-Lite-16B MMLU 58.1% 58.3% QWen2.5-0.5B MMLU 47.67% 47.5%
QWen2.5-1.5B MMLU 59.4% 60.9% QWen2.5-3B MMLU 65.6% 65.6%
QWen2.5-7B MMLU 73.8% 74.2% QWen2.5-14B MMLU 79.4% 79.7%
QWen2.5-32B MMLU 83.3% 83.3% QWen2.5-72B MMLU 85.59% 86.1%
InternLM2.5-1.8b MMLU 51.3% 53.5% InternLM2.5-7B MMLU 71.6% 71.6%
InternLM2.5-20b MMLU 73.3% 74.2% InternLM3-8b MMLU 76.6% 76.6%
Yi1.5-6B MMLU 63.2% 63.5% Yi1.5-9B MMLU 69.2% 69.5%
Yi1.5-34B MMLU 76.9% 77.1% CodeQWen2.5-7B Human. 66.5% 61.6%
Qwen2.5-Math-7B MMLU-STEM 67.8% 67.8% Qwen2.5-Math-72B MMLU-STEM 83.7% 82.8%
MiniCPM3-4B MMLU 63.7% 64.6% Phi-3.5-mini-instruct MMLU 64.39% 64.34%
Phi-3.5-MoE-instruct MMLU 78.5% 78.9% DeepSeek-Math-7B MMLU-STEM 56.5% 56.5%
DeepSeek-V2.5 MMLU 79.3% 80.6% DeepSeek-V2-236B MMLU 78.1% 78.5%
LLaMA3.3-70B-Instruct MMLU 82.7% -- QwQ-32B MMLU 81.19% --

评估指导手册

MindSpeed-LLM 评估操作指导手册请见链接:evaluation_guide.md

评估介绍

mmlu评估介绍

cmmlu评估介绍

boolq评估介绍

ceval评估介绍

gsm8k评估介绍

bbh评估介绍

hellaswag评估介绍

agi评估介绍

humanEval评估介绍

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/ascend/MindSpeed-LLM.git
git@gitee.com:ascend/MindSpeed-LLM.git
ascend
MindSpeed-LLM
MindSpeed-LLM
2.1.0

搜索帮助