代码拉取完成,页面将自动刷新
MindSpeed-LLM 支持大模型在公开基准数据集上进行准确率评估,当前支持的 Benchmark 如下:
Benchmark | 下载链接 | 验证集 | MindSpeed-LLM | OpenCompass |
---|---|---|---|---|
MMLU | GitHub | test | 45.73% | 45.3% |
CEval | HuggingFace | val | 33.87% | 32.5% |
BoolQ | GitHub | dev | 75.44% | 74.9% |
BBH | GitHub | test | 34.4% | 32.5% |
AGIEval | GitHub | test | 20.6% | 20.6% |
HumanEval | GitHub | test | 12.8% | 12.2% |
CMMLU | Kaggle | test | -- | -- |
GSM8k | GitHub | -- | -- | -- |
Hellaswag | GitHub | -- | -- | -- |
Needlebench | HuggingFace | -- | -- | -- |
MindSpeed-LLM 已支持的大模型评估数据统计如下:
模型 | 任务 | MindSpeed-LLM | 社区 | 模型 | 任务 | MindSpeed-LLM | 社区 |
---|---|---|---|---|---|---|---|
Aquila-7B | BoolQ | 77.3% | -- | Aquila2-7B | BoolQ | 77.8% | -- |
Aquila2-34B | BoolQ | 88.0% | -- | Baichuan-7B | BoolQ | 69.0% | 67.0% |
Baichuan-13B | BoolQ | 74.7% | 73.6% | Baichuan2-7B | BoolQ | 70.0% | 63.2% |
Baichuan2-13B | BoolQ | 78.0% | 67.0% | Bloom-7B | MMLU | 25.1% | -- |
Bloom-176B | BoolQ | 64.5% | -- | ChatGLM3-6B | MMLU | 61.5% | -- |
GLM4-9B | MMLU | 74.5% | 74.7% | CodeQwen1.5-7B | Human. | 54.8% | 51.8% |
CodeLLaMA-34B | HumanEval | 48.8% | 48.8% | Gemma-2B | MMLU | 39.6% | -- |
Gemma-7B | MMLU | 52.2% | -- | InternLM-7B | MMLU | 48.7% | 51.0% |
Gemma2-9B | MMLU | 70.7% | 71.3% | Gemma2-27B | MMLU | 75.5% | 75.2% |
LLaMA-7B | BoolQ | 74.6% | 75.4% | LLaMA-13B | BoolQ | 79.6% | 78.7% |
LLaMA-33B | BoolQ | 83.2% | 83.1% | LLaMA-65B | BoolQ | 85.7% | 86.6% |
LLaMA2-7B | MMLU | 45.7% | -- | LLaMA2-13B | BoolQ | 82.2% | 81.7% |
LLaMA2-34B | BoolQ | 82.0% | -- | LLaMA2-70B | BoolQ | 86.4% | -- |
LLaMA3-8B | MMLU | 65.2% | -- | LLaMA3-70B | BoolQ | 78.4% | -- |
LLaMA3.1-8B | MMLU | 65.3% | -- | LLaMA3.1-70B | MMLU | 81.8% | -- |
LLaMA3.2-1B | MMLU | 31.8% | 32.2% | LLaMA3.2-3B | MMLU | 56.3% | 58.0% |
Mistral-7B | MMLU | 56.3% | -- | Mixtral-8x7B | MMLU | 70.6% | 70.6% |
Mistral-8x22B | MMLU | 77% | 77.8% | MiniCPM-MoE-8x2B | BoolQ | 83.9% | -- |
QWen-7B | MMLU | 58.1% | 58.2% | Qwen-14B | MMLU | 65.3% | 66.3% |
QWen-72B | MMLU | 74.6% | 77.4% | QWen1.5-0.5B | MMLU | 39.1% | -- |
QWen1.5-1.8b | MMLU | 46.2% | 46.8% | QWen1.5-4B | MMLU | 59.0% | 56.1% |
QWen1.5-7B | MMLU | 60.3% | 61.0% | QWen1.5-14B | MMLU | 67.3% | 67.6% |
QWen1.5-32B | MMLU | 72.5% | 73.4% | QWen1.5-72B | MMLU | 76.4% | 77.5% |
Qwen1.5-110B | MMLU | 80.4% | 80.4% | Yi-34B | MMLU | 76.3% | 75.8% |
QWen2-0.5B | MMLU | 44.6% | 45.4% | QWen2-1.5B | MMLU | 54.7% | 56.5% |
QWen2-7B | MMLU | 70.3% | 70.3% | QWen2-57B-A14B | MMLU | 75.6% | 76.5% |
QWen2-72B | MMLU | 83.6% | 84.2% | MiniCPM-2B | MMLU | 51.6% | 53.4% |
DeepSeek-V2-Lite-16B | MMLU | 58.1% | 58.3% | QWen2.5-0.5B | MMLU | 47.67% | 47.5% |
QWen2.5-1.5B | MMLU | 59.4% | 60.9% | QWen2.5-3B | MMLU | 65.6% | 65.6% |
QWen2.5-7B | MMLU | 73.8% | 74.2% | QWen2.5-14B | MMLU | 79.4% | 79.7% |
QWen2.5-32B | MMLU | 83.3% | 83.3% | QWen2.5-72B | MMLU | 85.59% | 86.1% |
InternLM2.5-1.8b | MMLU | 51.3% | 53.5% | InternLM2.5-7B | MMLU | 71.6% | 71.6% |
InternLM2.5-20b | MMLU | 73.3% | 74.2% | InternLM3-8b | MMLU | 76.6% | 76.6% |
Yi1.5-6B | MMLU | 63.2% | 63.5% | Yi1.5-9B | MMLU | 69.2% | 69.5% |
Yi1.5-34B | MMLU | 76.9% | 77.1% | CodeQWen2.5-7B | Human. | 66.5% | 61.6% |
Qwen2.5-Math-7B | MMLU-STEM | 67.8% | 67.8% | Qwen2.5-Math-72B | MMLU-STEM | 83.7% | 82.8% |
MiniCPM3-4B | MMLU | 63.7% | 64.6% | Phi-3.5-mini-instruct | MMLU | 64.39% | 64.34% |
Phi-3.5-MoE-instruct | MMLU | 78.5% | 78.9% | DeepSeek-Math-7B | MMLU-STEM | 56.5% | 56.5% |
DeepSeek-V2.5 | MMLU | 79.3% | 80.6% | DeepSeek-V2-236B | MMLU | 78.1% | 78.5% |
LLaMA3.3-70B-Instruct | MMLU | 82.7% | -- | QwQ-32B | MMLU | 81.19% | -- |
MindSpeed-LLM 评估操作指导手册请见链接:evaluation_guide.md
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。