diff --git a/README.md b/README.md index ab4c280200221e15d3a261ad95a6a69cbc04ef0b..6e0ab015e87280f1c9b0190b6a03c9bcd958f200 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,11 @@ •**硬件规格:** 支持单机、双机、四机、大集群 -•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openeuler24.03-lts-sp2 +•**镜像地址:** + +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-800I-A2-openeuler24.03-lts-sp2 + +hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-800I-A2-openeuler24.03-lts-sp2 @@ -14,7 +18,11 @@ •**硬件规格:** 支持单机、双机 -•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 +•**镜像地址:** + +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 + +hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 @@ -22,17 +30,23 @@ •**硬件规格:** 支持单机单卡、单机多卡 -•**镜像地址:** hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openeuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openeuler24.03-lts-sp2- +•**镜像地址:** +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-A100-openeuler24.03-lts-sp2 + +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-syshax-openeuler24.03-lts-sp2 -**我们的愿景:** 基于 openEuler 构建开源的 AI 基础软件事实标准,推动企业智能应用生态的繁荣。 +**我们的愿景:** 基于 openEuler 构建开源的 AI 基础软件事实标准,推动企业智能应用生态的繁荣。 -**当大模型遇见产业落地,我们为何需要全栈方案?** + +**当大模型遇见产业落地,我们为何需要全栈方案?** DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻,需求大幅增加、多模态交互突破硬件限制、低算力需求重构部署逻辑,标志着AI从“技术验证期”迈入“规模落地期”。然而,产业实践中最核心的矛盾逐渐显现: + + +**产业痛点​** -**产业痛点​** **适配难​:** 不同行业(如金融、制造、医疗)的业务场景对推理延迟、算力成本、多模态支持的要求差异极大,单一模型或工具链难以覆盖多样化需求; @@ -45,6 +59,8 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 **资源协同低效​:** CPU/GPU/NPU异构算力调度依赖人工经验,内存/显存碎片化导致资源闲置; +**训练成本高昂​:** 随着模型规模的增长、上下文序列的延长,模型微调训练时对显存和计算资源的诉求飙升; + 为了解决以上问题,我们通过开源社区协同,加速开源推理方案Intelligence BooM成熟。 @@ -56,64 +72,104 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 #### **智能应用平台:让您的业务快速“接轨”AI​** -**组件构成 :** openHermes(智能体引擎,利用平台公共能力,Agent应用货架化,提供行业典型应用案例、多模态交互中间件,轻量框架,业务流编排、提示词工程等能力)、deeplnsight(业务洞察平台,提供多模态识别、Deep Research能力) +**组件构成 :** 智能应用平台(任务规划编排、OS领域模型、智能体MCP服务),多种智能Agent(智能调优、智能运维、智能问答、深度研究) + +【openEuler Intelligence开源地址】https://gitee.com/openeuler/euler-copilot-framework 【deepInsight开源地址】https://gitee.com/openeuler/deepInsight **核心价值** -**低代码开发:** openHermes提供自然语言驱动的任务编排能力,业务人员可通过对话式交互生成AI应用原型; +**智能化赋能操作系统,化被动调优/运维为半主动,实现操作系统智能辅助驾驶** + -**效果追踪​:** deeplnsight实时监控模型推理效果(如准确率、延迟、成本),结合业务指标(如转化率、故障率)给出优化建议,实现“数据-模型-业务”闭环。 +**智能调优** +突破多层系统负载感知,AI启发复杂系统策略等调优等技术,实现典型场景性能提升10%。 -#### **推理服务:让模型“高效跑起来”** +**智能运维** +构建OS智能运维助手,命令行->自然语言运维,典型运维命令覆盖100%,提升系统运维易用性,支撑生态推广。突破全栈协同分析与慢节点故障诊断技术,达成智算AI训推场景定位效率天->小时级。 -**组件构成​:** vLLM(高性能大模型推理框架)、SGLang(多模态推理加速库) -【vLLM开源地址】https://vllm.hyper.ai/docs +**深度研究** +基于多智能体的协同: 通过构筑多智能体(大纲规划、信息检索、评估反思、报告生成),突破单智能体的能力边界,提升复杂领域的研究效果。采用上下文工程技术: 结合长短期记忆、语义压缩、结构化输入输出等技术,优化上下文信息的改写、选择、压缩机制,让深度研究智能体在复杂任务中聚焦研究目标,降低幻觉效应。内容冲突检测: 分析多信息源(知识库、内网、公网)间存在的内容冲突,确保报告内容的真实客观性,增强用户对研究结果的信任度。 + + +**智能应用平台** +智能助手、调优、运维等通用能力下沉,构筑智能体服务、领域知识及系统记忆服务等技术。 + + + +#### **推理服务:让模型“高效跑起来”** + + +**组件构成​:** vLLM、SGLang、LLaMA Factory + **核心价值​** -**动态扩缩容:** vLLM支持模型按需加载,结合K8s自动扩缩容策略,降低70%以上空闲算力成本; +**动态扩缩容:** vLLM支持模型按需加载,结合K8s自动扩缩容策略,降低70%以上空闲算力成本; + + +**大模型优化​:** vLLM通过PagedAttention、连续批处理等技术,将万亿参数模型的推理延迟降低50%,吞吐量提升3倍; + + +**低成本模型微调:** 开箱即用,一站式从数据生成到微调增训,中小模型场景支持Atlas 3000等低成本硬件;大模型与多模态场景支持显存友好的Atlas 800 A2高效训推,同时提供昇腾亲和的并行策略调优工具 -**大模型优化​:** vLLM通过PagedAttention、连续批处理等技术,将万亿参数模型的推理延迟降低50%,吞吐量提升3倍; #### **加速层:让推理“快人一步”​​** -**组件构成​:** sysHAX、expert-kit、ktransformers +**组件构成​:** sysHAX、expert-kit、LMCache + + +【sysHAX开源地址】https://gitee.com/openeuler/sysHAX -【sysHAX开源地址】https://gitee.com/openeuler/sysHAX -【expert-kit开源地址】https://gitee.com/openeuler/expert-kit +【expert-kit开源地址】https://gitee.com/openeuler/expert-kit + + +【LMCache开源地址】https://gitee.com/openeuler/LMCache-mindspore 、https://github.com/LMCache/LMCache + **核心价值​** -**异构算力协同分布式推理加速引擎:** 整合CPU、NPU、GPU等不同架构硬件的计算特性,通过动态任务分配实现"专用硬件处理专用任务"的优化,将分散的异构算力虚拟为统一资源池,实现细粒度分配与弹性伸缩; +**异构算力协同分布式推理加速引擎:** 整合CPU、NPU、GPU等不同架构硬件的计算特性,通过动态任务分配实现"专用硬件处理专用任务"的优化,将分散的异构算力虚拟为统一资源池,实现细粒度分配与弹性伸缩;LMCache提供了管理大规模kvcache的内存池能力,能够串联HBM、DDR、Disk以及远端存储池,其中大规模提升的性能主要基于Prefix Caching(多实例间共享kvcache)、CacheGen(对kvcache进行压缩,节约kvcache传输时间)、CacheBlend(提高缓存命中率) -#### **框架层:让模型“兼容并蓄”** -**组件构成​:** MindSpore(全场景框架)、PyTorch(Meta通用框架)、TensorFlow(Google工业框架) +#### **框架层:让模型“兼容并蓄”** + + +**组件构成​:** MindSpore(全场景框架)、PyTorch(Meta通用框架)、MS-InferRT(MindSpore框架下的推理优化组件,兼容PyTorch) + + +【MindSpore开源地址】https://gitee.com/mindspore + + +**核心价值​** + + +**多框架兼容:** 通过统一API接口,支持用户直接调用任意框架训练的模型,无需重写代码; + +**动态图优化​:** 针对大模型的动态控制流(如条件判断、循环),提供图优化能力,推理稳定性提升30%; + +​**社区生态复用​:** 完整继承PyTorch的生态工具(如Hugging Face模型库),降低模型迁移成本。 -【MindSpore开源地址】https://gitee.com/mindspore -**核心价值​** -**多框架兼容:** 通过统一API接口,支持用户直接调用任意框架训练的模型,无需重写代码; -**动态图优化​:** 针对大模型的动态控制流(如条件判断、循环),提供图优化能力,推理稳定性提升30%; -​**社区生态复用​:** 完整继承PyTorch/TensorFlow的生态工具(如Hugging Face模型库),降低模型迁移成本。 +#### **数据工程、向量检索、数据融合分析:从原始数据到推理燃料的转化​** +**组件构成​:** openGauss、PG Vector、Datajuicer… -#### **数据工程、向量检索、数据融合分析:从原始数据到推理燃料的转化​** -**组件构成​:** DataJuicer、Oasis、九天计算引擎、PG Vector、Milvus、GuassVector、Lotus、融合分析引擎 +【openGauss开源地址】https://gitee.com/opengauss + **核心价值​** @@ -125,7 +181,9 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 #### **任务管理平台:让资源“聪明调度”​​** -**组件构成​:** openFuyao(任务编排引擎)、K8S(容器编排)、RAY(分布式计算)、oeDeploy(一键部署工具) +**组件构成​:** openYuanrong、openFuyao(任务编排引擎)、K8S(容器编排)、RAY(分布式计算)、oeDeploy(一键部署工具) + +【openYuanrong社区地址】https://www.openeuler.openatom.cn/zh/projects/yuanrong/ 【openFuyao开源地址】https://gitcode.com/openFuyao @@ -135,112 +193,98 @@ DeepSeek创新降低大模型落地门槛,AI进入“杰文斯悖论”时刻 **核心价值​** -**端边云协同:** 根据任务类型(如实时推理/离线批处理)和硬件能力(如边缘侧NPU/云端GPU),自动分配执行节点; +**分布式计算引擎:** 提供一套统一Serverless架构支持AI、大数据、微服务等各类分布式应用,提供多语言函数编程接口,以单机编程体验简化分布式应用开发;同时提供分布式动态调度和数据共享能力,实现分布式应用的高性能运行和集群的高效资源利用。 -**全生命周期管理​:** 从模型上传、版本迭代、依赖安装到服务启停,提供“一站式”运维界面; -​**故障自愈​:** 实时监控任务状态,自动重启异常进程、切换备用节点,保障服务高可用性。 +**端边云协同:** 根据任务类型(如实时推理/离线批处理)和硬件能力(如边缘侧NPU/云端GPU),自动分配执行节点; -#### **编译器:让代码“更懂硬件”​​** +**全生命周期管理​:** 从模型上传、版本迭代、依赖安装到服务启停,提供“一站式”运维界面; -**组件构成​:** 异构融合编译器(Bisheng) -**核心价值** +​**故障自愈​:** 实时监控任务状态,自动重启异常进程、切换备用节点,保障服务高可用性。 -**跨硬件优化:** 针对CPU(x86/ARM)、GPU(CUDA)、NPU(昇腾/CANN)的指令集差异,自动转换计算逻辑,算力利用率大幅提升%; -**混合精度支持​:** 动态调整FP32/FP16/INT8精度,在精度损失可控的前提下,推理速度大幅提升; -​**内存优化​:** 通过算子融合、内存复用等技术,减少30%显存/内存占用,降低硬件成本。 -#### **操作系统:让全栈“稳如磐石”** -**组件构成​:** openEuler(开源欧拉操作系统) -【openEuler开源地址】https://gitee.com/openeuler +#### **编译器:让代码“更懂硬件”​​** -**核心价值** -**异构资源管理:** 原生支持CPU/GPU/NPU的统一调度,提供硬件状态监控、故障隔离等能力; +**组件构成​:** 异构融合编译器AscendNPUIR、算子自动生成工具AKG -**安全增强​:** 集成国密算法、权限隔离、漏洞扫描模块,满足金融、政务等行业的合规要求。 +【AKG开源地址】https://gitee.com/mindspore/akg -#### **硬件使能与硬件层:让算力“物尽其用”** +**核心价值** -**组件构成​:** CANN(昇腾AI使能套件)、CUDA(英伟达计算平台)、CPU(x86/ARM)、NPU(昇腾)、GPU(英伟达/国产GPU) -**核心价值** +**跨硬件优化:** 针对CPU(x86/ARM)、GPU(CUDA)、NPU(昇腾/CANN)的指令集差异,自动转换计算逻辑,算力利用率大幅提升; -**硬件潜能释放:** CANN针对昇腾NPU的达芬奇架构优化矩阵运算、向量计算,算力利用率大幅提升;CUDA提供成熟的GPU并行计算框架,支撑通用AI任务; -**异构算力融合​:** 通过统一编程接口(如OpenCL),实现CPU/NPU/GPU的协同计算,避免单一硬件性能瓶颈; -​ +**混合精度支持​:** 动态调整FP32/FP16/INT8精度,在精度损失可控的前提下,推理速度大幅提升; -#### **互联技术:让硬件“高速对话”​​** -**组件构成​:** CXL(计算与内存扩展)、NvLink(英伟达高速互联)、SUE +​**内存优化​:** 通过算子融合、内存复用等技术,减少30%显存/内存占用,降低硬件成本。 -**核心价值** -**低延迟通信:** CXL/NvLink提供内存级互联带宽(>1TB/s),减少跨设备数据拷贝开销 -**灵活扩展:** 支持从单机(多GPU)到集群(跨服务器)的无缝扩展,适配不同规模企业的部署需求。 +#### **操作系统:让全栈“稳如磐石”** +**组件构成​:** openEuler(开源欧拉操作系统)、FalconFS(高性能分布式存储池)、GMEM(异构融合内存)、XSched(异构算力切分)、xMig(XPU迁移)、ModelFS(可编程页缓存) +【openEuler开源地址】https://gitee.com/openeuler -## 全栈解决方案部署教程 +【FalconFS开源地址】https://gitee.com/openeuler/FalconFS -目前方案已支持**DeepSeek**/**Qwen**/**Llama**/**GLM**/**TeleChat**等50+主流模型,以下我们选取DeepSeek V3&R1 模型和 openEuler Intelligence 应用的部署来 +【GMEM开源地址】https://gitee.com/openeuler/kernel -### DeepSeek V3&R1部署 +【XSched开源地址】https://gitee.com/openeuler/libXSched -参考[部署指南](https://gitee.com/openeuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3&R1%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97.md),使用一键式部署脚本,20min完成推理服务拉起。 +【XMig开源地址】https://gitee.com/openeuler/xmig + +【ModelFS开源地址】https://gitee.com/openeuler/kernel/tree/OLK-6.6/fs/mfs +**核心价值** +**异构资源管理:** 原生支持CPU/GPU/NPU的统一调度,提供硬件状态监控、故障隔离等能力; -### 一键式部署DeepSeek 模型和openEuler Intelligence智能应用 +**安全增强​:** 集成国密算法、权限隔离、漏洞扫描模块,满足金融、政务等行业的合规要求。 -参考[一键式部署openEuler Intelligence ](https://gitee.com/openeuler/llm_solution/tree/master/script/mindspore-intelligence),搭建本地知识库并协同DeepSeek大模型完成智能调优、智能运维等应用; +**模型权重快速加载:**可编程页缓存以及动态缓存,权重加载速度倍级提升 -## 性能 +#### **硬件使能与硬件层:让算力“物尽其用”** -### 精度 +**组件构成​:** CANN(昇腾AI使能套件)、CUDA(英伟达计算平台)、CPU(x86/ARM)、NPU(昇腾)、GPU(英伟达/国产GPU) -本方案使用8bit权重量化、SmoothQuant -8bit量化和混合量化等技术,最终以CEval精度损失2分的代价,实现了DeepSeek-R1w8a8的大模型部署。 -| 模型 | CEval精度 | -| ---------------------- | --------- | -| Claude-3.5-Sonnet-1022 | 76.7 | -| GPT-4o 0513 | 76 | -| DeepSeek V3 | 86.5 | -| GPT-4o 0513 | 76 | -| OpenAI o1-mini | 68.9 | -| DeepSeek R1 | 91.8 | -| Deepseek R1 w8a8 | 89.52 | -| Deepseek R1 W4A16 | 88.78 | -| Deepseek V3 0324 W4A16 | 87.82 | +**核心价值** +**硬件潜能释放:** CANN针对昇腾NPU的达芬奇架构优化矩阵运算、向量计算,算力利用率大幅提升;CUDA提供成熟的GPU并行计算框架,支撑通用AI任务; -### 吞吐 +**异构算力融合​:** 通过统一编程接口(如OpenCL),实现CPU/NPU/GPU的协同计算,避免单一硬件性能瓶颈; -测试环境: +​ -1. 两台Atlas 800I A2(8\*64G)。 -2. Ascend HDK Driver 24.1.0版本,Firmware 7.5.0.3.22版本。 -3. openEuler 22.03 LTS版本(内核 5.10)。 -| 并发数 | 吞吐(Token/s) | -| ------ | ------------- | -| 1 | 22.4 | -| 192 | 2600 | +## 全栈解决方案部署教程 + +目前方案已支持**DeepSeek**/**Qwen**/**Llama**/**GLM**/**TeleChat**等50+主流模型,以下我们选取DeepSeek V3&R1 模型和 openEuler Intelligence 应用的部署 +### DeepSeek V3&R1部署 + +参考[部署指南](https://gitee.com/openeuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3&R1%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97.md),使用一键式部署脚本,20min完成推理服务拉起。 + + + +### 一键式部署DeepSeek 模型和openEuler Intelligence智能应用 + +参考[一键式部署openEuler Intelligence ](https://gitee.com/openeuler/llm_solution/tree/master/script/mindspore-intelligence),搭建本地知识库并协同DeepSeek大模型完成智能调优、智能运维等应用; diff --git a/README_en.md b/README_en.md index 6ab365fe5dd2761593bace4b3f8f235b5a261d9c..fb9e67266408136c02d0c61ecee66332bc871bf1 100644 --- a/README_en.md +++ b/README_en.md @@ -6,33 +6,52 @@ **Hardware specifications:** Supports single-node system, two-node cluster, four-node cluster, and large cluster. -**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-800I-A2-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/x86_64/intelligence_boom:0.1.0-x86_64-800I-A2-openEuler24.03-lts-sp2 +**image path:** +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-800I-A2-openeuler24.03-lts-sp2 + +hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-800I-A2-openeuler24.03-lts-sp2 + **CPU+NPU (300I Duo)** -**Hardware specifications:** Single-node system and two-node cluster are supported. +**Hardware specifications:** Single-node system and two-node cluster are supported. + + +**image path:** +hub.oepkgs.net/oedeploy/openeuler/aarch64/intelligence_boom:0.2.0-aarch64-300I-Duo-openeuler24.03-lts-sp2 + +hub.oepkgs.net/oedeploy/openeuler/x86_64/intelligence_boom:0.2.0-x86_64-300I-Duo-openeuler24.03-lts-sp2 -**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-300I-Duo-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/x86_64/intelligence_boom:0.1.0-x86_64-300I-Duo-openEuler24.03-lts-sp2 **CPU+GPU (NVIDIA A100)** -·**Hardware specifications:** Supports single-node single-card and single-node multi-card. +**Hardware specifications:** Supports single-node single-card and single-node multi-card. + + +**image path:** + +hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.2.0-aarch64-A100-openEuler24.03-lts-sp2 -**image path:** hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-A100-openEuler24.03-lts-sp2 hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.1.0-aarch64-syshax-openEuler24.03-lts-sp2- +hub.oepkgs.net/oedeploy/openEuler/aarch64/intelligence_boom:0.2.0-aarch64-syshax-openEuler24.03-lts-sp2 + + +**Our vision:** Build open-source AI basic software de facto standards based on openEuler to promote the prosperity of the enterprise intelligent application ecosystem. -**Our vision:** Build open-source AI basic software de facto standards based on openEuler to promote the prosperity of the enterprise intelligent application ecosystem. **When big models meet industry implementation, why do we need a full-stack solution?** -DeepSeek innovation lowers the threshold for implementing large models. AI enters the "Jervins Paradox" moment. Requirements increase significantly, multi-modal interaction breaks through hardware restrictions, and low computing power requirements reconstruct deployment logic, marking the transition from the "technical verification period" to the "scale implementation period". However, the core contradictions in industry practice gradually emerged: +DeepSeek innovation lowers the threshold for implementing large models. AI enters the "Jervins Paradox" moment. Requirements increase significantly, multi-modal interaction breaks through hardware restrictions, and low computing power requirements reconstruct deployment logic, marking the transition from the "technical verification period" to the "scale implementation period". However, the core contradictions in industry practice gradually emerged: + + +**Industry pain points** -**Industry pain points** -**Difficult adaptation:** The requirements for inference delay, computing cost, and multi-modal support vary greatly in different industries (such as finance, manufacturing, and healthcare). A single model or tool chain cannot cover diversified requirements. +**Difficult adaptation:** The requirements for inference delay, computing cost, and multi-modal support vary greatly in different industries (such as finance, manufacturing, and healthcare). A single model or tool chain cannot cover diversified requirements. -**High cost:** From model training to deployment, collaboration between (PyTorch/TensorFlow/MindSpore), hardware (CPU/GPU/NPU), and storage (relational database/vector database) is required. Hardware resource utilization is low and O&M complexity increases exponentially. -**Ecosystem fragmentation:** Tool chains of hardware vendors (such as Huawei and NVIDIA), framework vendors (such as Meta, and Google) are incompatible with each other. Patchwork deployment leads to long development cycles and inefficient iterations. Technical challenges +**High cost:** From model training to deployment, collaboration between (PyTorch/TensorFlow/MindSpore), hardware (CPU/GPU/NPU), and storage (relational database/vector database) is required. Hardware resource utilization is low and O&M complexity increases exponentially. + +**Ecosystem fragmentation:** Tool chains of hardware vendors (such as Huawei and NVIDIA), framework vendors (such as Meta, and Google) are incompatible with each other. Patchwork deployment leads to long development cycles and inefficient iterations. Technical challenges. **Inference efficiency bottleneck:** The parameter scale of large models exceeds trillions. Traditional inference engines do not support dynamic graph calculation, sparse activation, and hybrid precision, causing a serious waste of computing power. @@ -46,43 +65,64 @@ To solve the preceding problems, we collaborate with the open source community t #### **Intelligent Application Platform: Quickly Connect Your Business to AI** #### -**Component:** openHermes (Agent-Tone Engine, which uses the public capabilities of the platform and provides the following capabilities: Typical application cases, multi-modal interaction middleware, lightweight framework, service flow orchestration, and prompt word engineering.) and deeplnsight (Service insight platform, providing multi-modal identification and deep research capabilities) +**Component:** Smart Application Platform (Task Planning and Orchestration, OS Domain Model, Agent MCP Service), Multiple Intelligent Agents (Smart Tuning, Intelligent Operations, Intelligent Q&A, Deep Research) + +\[openEuler Intelligence open source address\] https://gitee.com/openeuler/euler-copilot-framework \[DeepInsight open source address\] https://gitee.com/openEuler/deepInsight **Core Value** -**Low-code development:** OpenHermes provides the natural language-driven task orchestration capability, allowing service personnel to generate AI application prototypes through dialogue-based interaction. +**Intelligent empowerment of operating systems transforms passive tuning and maintenance into semi-active operations, enabling intelligent assisted driving for operating systems.** + + +**intelligent optimization:** Breakthrough technologies such as multi-layer system load awareness and AI-inspired complex system optimization strategies have been implemented to achieve a performance improvement of over 10% in typical scenarios. -**Effect tracking:** Deeplnsight monitors the model inference effect (such as accuracy, delay, and cost) in real time, and provides optimization suggestions based on service indicators (such as conversion rate and failure rate), implementing closed-loop management of data, models, and services. +**Intelligent Operations and Maintenance:** Build an intelligent OS operation and maintenance assistant that transforms command-line operations into natural language commands, covering 100% of typical O&M commands to enhance system usability and support ecosystem promotion. Breakthroughs in full-stack collaborative analysis and slow node fault diagnosis technologies have been achieved, reducing the time for locating faults in AI training and inference scenarios from days to hours. -#### **Inference service: enabling models to run efficiently** #### +**DeepResearch:** Collaboration based on multiple agents: By constructing multiple agents (including outline planning, information retrieval, evaluation and reflection, and report generation), we can break through the limitations of single-agent capabilities and enhance research outcomes in complex domains. Contextual engineering techniques: We employ technologies such as long short-term memory, semantic compression, and structured input-output to optimize mechanisms for rewriting, selecting, and compressing contextual information. This allows the in-depth research agent to focus on research objectives in complex tasks and reduce the hallucination effect. Content conflict detection: We analyze content conflicts among multiple information sources (knowledge bases, internal networks, and public networks) to ensure the authenticity and objectivity of the report content, thereby enhancing users' trust in the research results. + +**Intelligent Application Platform:** General capabilities such as intelligent assistants, optimization, and operation and maintenance are integrated into the platform to build technologies like intelligent agent services, domain knowledge, and system memory services. + + +#### **Inference service: enabling models to run efficiently** #### + +**Components:** vLLM , SGLang and LLaMA Factory -**Components:** vLLM (high-performance large model inference framework) and SGLang (multi-modal inference acceleration library) \[vLLM open source address\] https://docs.vllm.ai/en/latest/ -**Core Value** +**Core Value** + **Dynamic scaling:** vLLM supports on-demand model loading and uses the Kubernetes automatic scaling policy to reduce the idle computing power cost by more than 70%. -**Big model optimization:** VLLM uses technologies such as PagedAttention and continuous batch processing to reduce the inference delay of trillion-parameter models by 50% and improve the throughput by three times. +**Big model optimization:** VLLM uses technologies such as PagedAttention and continuous batch processing to reduce the inference delay of trillion-parameter models by 50% and improve the throughput by three times. + +**Low-Cost Model Fine-Tuning:** Ready-to-use out of the box, providing a one-stop solution from data generation to fine-tuning and training. It supports low-cost hardware such as Atlas 3000 for small and medium-sized model scenarios; for large models and multimodal scenarios, it supports Atlas 800 A2 with memory-friendly efficient training and inference, and also offers Ascend-optimized parallel strategy tuning tools. + #### **Acceleration layer: Make reasoning "one step faster"** #### -**Components:** sysHAX, expert-kit, and ktransformers +**Components:** sysHAX, expert-kit, and LMCache + +\[sysHAX open source address\] https://gitee.com/openEuler/sysHAX + +\[Expert-Kit open source address\] https://gitee.com/openEuler/expert-kit + +\[LMCache open source address\] https://gitee.com/openeuler/LMCache-mindspore, https://github.com/LMCache/LMCache -\[sysHAX open source address\] https://gitee.com/openEuler/sysHAX -\[Expert-Kit open source address\] https://gitee.com/openEuler/expert-kit **Core Value** -**Heterogeneous computing power collaboration distributed inference acceleration engine:** Integrates the computing features of different architecture hardware such as CPU, NPU, and GPU, optimizes dedicated hardware processing dedicated tasks through dynamic task allocation, virtualizes scattered heterogeneous computing power into a unified resource pool, implementing fine-grained allocation and elastic scaling. +**Heterogeneous computing power collaboration distributed inference acceleration engine:** By integrating the computational characteristics of different hardware architectures such as CPUs, NPUs, and GPUs, dynamic task allocation is used to optimize the principle of "dedicated hardware for dedicated tasks." This approach virtualizes scattered heterogeneous computing power into a unified resource pool, enabling fine-grained allocation and elastic scaling. LMCache provides the capability to manage large-scale KV caches with a memory pool, connecting HBM, DDR, disk, and remote storage pools. The significant performance improvements are primarily based on Prefix Caching (sharing KV cache among multiple instances), CacheGen (compressing KV cache to save transmission time), and CacheBlend (improving cache hit rates). + #### **Framework layer: Make the model "inclusive"** #### -**Components:** MindSpore (all-scenario framework), PyTorch (Meta general framework), and TensorFlow (Google industrial framework) +**Components:** MindSpore (all-scenario framework), PyTorch (Meta general framework), and MS-InferRT (Inference optimization component under the MindSpore framework, compatible with PyTorch) + \[MindSpore open source address\] https://gitee.com/mindspore @@ -92,9 +132,12 @@ To solve the preceding problems, we collaborate with the open source community t **Dynamic graph optimization:** For dynamic control flows (such as condition judgment and loop) of large models, the graph optimization capability is provided, improving the inference stability by 30%. Community ecosystem reuse: Inherit ecosystem tools (such as Hugging Face model library) of PyTorch/TensorFlow, reducing model migration costs. +**Community Ecosystem Reuse:** Fully inherits PyTorch's ecosystem tools (such as the Hugging Face model library), reducing the cost of model migration. + + #### **Data engineering, vector retrieval, and data fusion analysis:** transformation from raw data to inference fuel #### -**Components:** DataJuicer, Oasis, nine-day computing engine, PG Vector, Milvus, GuassVector, Lotus, and converged analysis engine +**Components:** openGauss、PG Vector、Datajuicer… **Core Value** @@ -104,108 +147,122 @@ To solve the preceding problems, we collaborate with the open source community t #### **Task management platform: smart resource scheduling** #### -**Components:** OpenFuyao (task orchestration engine), K8S (container orchestration), RAY (distributed computing), and oeDeploy (one-click deployment tool) +**Components:** OpenFuyao (task orchestration engine), K8S (container orchestration), RAY (distributed computing), and oeDeploy (one-click deployment tool) -\[OpenFuyao open source address\] https://gitcode.com/openFuyao +\[OpenYuanrong open source address\] https://www.openeuler.openatom.cn/zh/projects/yuanrong/ -\[Ray open source address\] https://gitee.com/src-openEuler/ray -\[Open source address of the oeDeploy\] https://gitee.com/openEuler/oeDeploy +\[OpenFuyao open source address\] https://gitcode.com/openFuyao -**Core Value** -**Device-edge-cloud synergy:** Automatically allocates execution nodes based on task types (such as real-time inference and offline batch processing) and hardware capabilities (such as edge NPUs and cloud GPUs). +\[Ray open source address\] https://gitee.com/src-openEuler/ray + + +\[Open source address of the oeDeploy\] https://gitee.com/openEuler/oeDeploy + + +**Core Value** + +**Distributed Computing Engine:** Provides a unified serverless architecture that supports various distributed applications such as AI, big data, and microservices. It offers multi-language function programming interfaces to simplify the development of distributed applications with a single-machine programming experience. Additionally, it provides distributed dynamic scheduling and data sharing capabilities to achieve high-performance operation of distributed applications and efficient resource utilization in clusters. + + +**Device-edge-cloud synergy:** Automatically allocates execution nodes based on task types (such as real-time inference and offline batch processing) and hardware capabilities (such as edge NPUs and cloud GPUs). + + +**Full-lifecycle management:** provides a one-stop O&M interface, including model upload, version iteration, dependency installation, and service startup and shutdown. Fault self-healing: Monitors the task status in real time, automatically restarts abnormal processes, and switches services to the standby node, ensuring high service availability. + + +**Self-healing:** Real-time monitoring of task status, automatic restart of abnormal processes, and switching to standby nodes to ensure high service availability. -**Full-lifecycle management:** provides a one-stop O&M interface, including model upload, version iteration, dependency installation, and service startup and shutdown. Fault self-healing: Monitors the task status in real time, automatically restarts abnormal processes, and switches services to the standby node, ensuring high service availability. #### **Compiler: Making Code "More Hardware-Savvy"** #### -**Component composition:** Heterogeneous integration compiler (Bisheng) +**Component composition:** Heterogeneous fusion compiler AscendNPUIR, and operator auto-generation tool AKG + +\[Open source address of the AKG\] https://gitee.com/mindspore/akg + **Core Value** -**Cross-hardware optimization:** Automatically converts computing logic based on instruction set differences between CPU (x86/ARM), GPU (CUDA), and NPU (Ascend/CANN), greatly improving computing power utilization by%. +**Cross-hardware optimization:** Automatically converts computing logic to address differences in instruction sets among CPUs (x86/ARM), GPUs (CUDA), and NPUs (Ascend/CANN), significantly improving computing power utilization. + + +**Mixed precision support:** Dynamically adjust the FP32/FP16/INT8 precision, greatly improving the inference speed while the precision loss is controllable. Memory optimization:**Reduces the video memory and memory usage by 30% and reduces hardware costs by using technologies such as operator convergence and memory overcommitment. + +**Memory Optimization:** By using techniques such as operator fusion and memory reuse, we reduce GPU/RAM usage by 30%, thereby lowering hardware costs. + -**Mixed precision support:** Dynamically adjust the FP32/FP16/INT8 precision, greatly improving the inference speed while the precision loss is controllable. Memory optimization:**Reduces the video memory and memory usage by 30% and reduces hardware costs by using technologies such as operator convergence and memory overcommitment. #### **Operating System: Make the Full Stack "Stand as a Rock"** #### -**Component:** openEuler (open-source EulerOS) +**Component:** openEuler (open-source Euler operating system), FalconFS (high-performance distributed storage pool), GMEM (heterogeneous converged memory), XSched (heterogeneous computing power partitioning), xMig (XPU migration), ModelFS (programmable page cache) + + +\[openEuler open source address\] https://gitee.com/openEuler + +\[FalconFS open source address\] https://gitee.com/openeuler/FalconFS -\[openEuler open source address\] https://gitee.com/openEuler +\[GMEM open source address\] https://gitee.com/openeuler/kernel + +\[GXSched open source address\] https://gitee.com/openeuler/libXSched + +\[xMig open source address\] https://gitee.com/openeuler/xmig + +\[ModelFS open source address\] https://gitee.com/openeuler/kernel/tree/OLK-6.6/fs/mfs **Core Value** -**Heterogeneous resource management:** Supports unified scheduling of CPUs, GPUs, and NPUs, and provides capabilities such as hardware status monitoring and fault isolation. +**Heterogeneous resource management:** Supports unified scheduling of CPUs, GPUs, and NPUs, and provides capabilities such as hardware status monitoring and fault isolation. -**Security enhancement:** Integrates the Chinese national cryptographic algorithm, permission isolation, and vulnerability scanning modules to meet compliance requirements of industries such as finance and government. -#### **Hardware Enablement and Hardware Layer: Make the Most of Computing Power** #### +**Security enhancement:** Integrates the Chinese national cryptographic algorithm, permission isolation, and vulnerability scanning modules to meet compliance requirements of industries such as finance and government. -**Components:** CANN (Ascend AI enablement suite), CUDA (Nvidia computing platform), CPU (x86/ARM), NPU (Ascend), GPU (Nvidia/GPU in China) +**Fast Model Weight Loading:** Programmable Page Cache and Dynamic Caching, Doubles the Speed of Weight Loading. -**Core Value** -**Hardware potential release:** CANN optimizes matrix computing and vector computing for Ascend NPU Da Vinci architecture, greatly improving computing power utilization. CUDA provides a mature GPU parallel computing framework to support common AI tasks. +#### **Hardware Enablement and Hardware Layer: Make the Most of Computing Power** #### + + +**Components:** CANN (Ascend AI enablement suite), CUDA (Nvidia computing platform), CPU (x86/ARM), NPU (Ascend), GPU (Nvidia/GPU in China) + + +**Core Value** + + +**Hardware potential release:** CANN optimizes matrix computing and vector computing for Ascend NPU Da Vinci architecture, greatly improving computing power utilization. CUDA provides a mature GPU parallel computing framework to support common AI tasks. + -**Heterogeneous computing power convergence:** Using unified programming interfaces (such as OpenCL) to implement collaborative computing among CPUs, NPUs, and GPUs, avoiding performance bottlenecks of a single hardware. +**Heterogeneous computing power convergence:** Using unified programming interfaces (such as OpenCL) to implement collaborative computing among CPUs, NPUs, and GPUs, avoiding performance bottlenecks of a single hardware. -#### **Connected technology: "high-speed conversation" with hardware** #### -**Component composition:** CXL (computing and memory expansion), NVLink (Nvidia high-speed interconnection), SUE +#### **Connected technology: "high-speed conversation" with hardware** #### -**Core Values** -**Low latency communication:** CXL/NvLink provides memory-class interconnect bandwidth (>1 TB/s) to reduce cross-device data copy overhead -**Flexible expansion:** Supports seamless expansion from a single-node system (multi-GPU) to a cluster (cross-server) to meet the deployment requirements of enterprises of different scales. +## Full-Stack Solution Deployment Tutorial ## -## Full-Stack Solution Deployment Tutorial ## -Currently, the solution supports more than 50 mainstream models, such as DeepSeek/Qwen/Llama/GLM/TeleChat. The following describes how to deploy the DeepSeek V3&R1 model and openEuler Intelligence application. +Currently, the solution supports more than 50 mainstream models, such as DeepSeek/Qwen/Llama/GLM/TeleChat. The following describes how to deploy the DeepSeek V3&R1 model and openEuler Intelligence application. -### DeepSeek V3 and R1 deployment ### -Reference[Deployment Guide](https://gitee.com/openEuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3%26R1Deployment%20Guide_en.md) Use the one-click deployment script to start the inference service within 20 minutes. +### DeepSeek V3 and R1 deployment ### -### One-click deployment of the DeepSeek model and openEuler Intelligence intelligent application ### -Reference[One-click deployment of openEuler Intelligence](https://gitee.com/openEuler/llm_solution/tree/master/script/mindspore-intelligence/README_en.md) Build a local knowledge base and collaborate with the DeepSeek big model to complete applications such as intelligent optimization and intelligent O&M. +Reference[Deployment Guide](https://gitee.com/openEuler/llm_solution/blob/master/doc/deepseek/DeepSeek-V3%26R1Deployment%20Guide_en.md) Use the one-click deployment script to start the inference service within 20 minutes. -## Performance ## -### Precision ### +### One-click deployment of the DeepSeek model and openEuler Intelligence intelligent application ### -This solution uses 8-bit weight quantization, SmoothQuant 8-bit quantization, and hybrid quantization technologies, and finally realizes the deployment of DeepSeek-R1w8a8 large model at the cost of CEval precision loss of 2 points. -| modelling | CEval Accuracy | -| ---------------------- | -------------- | -| Claude-3.5-Sonnet-1022 | 76.7 | -| GPT-4o 0513 | 76 | -| DeepSeek V3 | 86.5 | -| GPT-4o 0513 | 76 | -| OpenAI o1-mini | 68.9 | -| DeepSeek R1 | 91.8 | -| Deepseek R1 w8a8 | 89.52 | -| Deepseek R1 W4A16 | 88.78 | -| Deepseek V3 0324 W4A16 | 87.82 | +Reference[One-click deployment of openEuler Intelligence](https://gitee.com/openEuler/llm_solution/tree/master/script/mindspore-intelligence/README_en.md) Build a local knowledge base and collaborate with the DeepSeek big model to complete applications such as intelligent optimization and intelligent O&M. -### Throughput ### -Test environment: -1. Two Atlas 800I A2 (8 x 64 GB) -2. Ascend HDK Driver 24.1.0 and Firmware 7.5.0.3.22. -3. openEuler 22.03 LTS version (kernel 5.10). +## Participation and contribution ## -| Number of concurrent requests | Throughput (Token/s) | -| ----------------------------- | -------------------- | -| 1 | 22.4 | -| 192 | 2600 | -## Participation and contribution ## +Welcome to provide your valuable suggestions in the issue mode to build a full-stack open source inference solution with excellent out-of-the-box and leading performance. -Welcome to provide your valuable suggestions in the issue mode to build a full-stack open source inference solution with excellent out-of-the-box and leading performance. # llm_solution # diff --git a/doc/deepseek/asserts/IntelligenceBoom.png b/doc/deepseek/asserts/IntelligenceBoom.png index 53464ffee59f86edb420a85e51966813fab25bd7..adae850e413af5943360401ddc918cb28cf3d686 100644 Binary files a/doc/deepseek/asserts/IntelligenceBoom.png and b/doc/deepseek/asserts/IntelligenceBoom.png differ diff --git a/doc/deepseek/asserts/IntelligenceBoom_en.png b/doc/deepseek/asserts/IntelligenceBoom_en.png index 8be7b74d3a06a9cf911467fa5d2789d464110b5b..202fb71a6c32b6e7f83e70150daa4c2e092d740e 100644 Binary files a/doc/deepseek/asserts/IntelligenceBoom_en.png and b/doc/deepseek/asserts/IntelligenceBoom_en.png differ