# omni-ops

**Repository Path**: omniai/omni-ops

## Basic Information

- **Project Name**: omni-ops
- **Description**: No description available
- **Primary Language**: C++
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 64
- **Created**: 2026-05-26
- **Last Updated**: 2026-06-18

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# omni-ops

昇腾亲和高性能自定义算子仓库，面向大模型训练与推理场景，提供基于 AscendC、PyPTO、Triton 多种实现方式的高性能算子，并通过Pytorch Adapter(PTA)暴露为`torch.ops.custom.npu_*`接口，支持在Pytorch中直接调用

## 项目结构

```
omni-ops/
├── inference/                                                  # 推理场景算子
│   └── ascendc/                                                # AscendC 实现
│       ├── cmake/                                              # CMake 构建模块与脚本
│       ├── scripts/                                            # 构建/部署辅助脚本
│       ├── src/                                                # 源码目录
│       │   ├── ops-transformer/                                # Transformer 类算子
│       │   │   ├── attention/                                  # 注意力类算子
│       │   │   │   ├── ai_infra_sparse_flash_attention_gqa/    # 稀疏 Flash Attention (GQA)
│       │   │   │   ├── ai_infra_sparse_flash_attention_pioneer/ # 稀疏 Flash Attention (Pioneer)
│       │   │   │   ├── ai_infra_kv_quant_sparse_flash_attention/ # KV 量化稀疏 Flash Attention
│       │   │   │   ├── ai_infra_fused_infer_attention_sink/    # 融合推理 Attention Sink
│       │   │   │   ├── ai_infra_fused_infer_attention_sink_metadata/ # Attention Sink 元数据
│       │   │   │   ├── ai_infra_fused_causal_conv1d/           # 融合因果一维卷积
│       │   │   │   ├── ai_infra_causal_conv1d_add/             # 因果一维卷积加法
│       │   │   │   ├── ai_infra_chunk_gated_delta_rule_recurrence/ # 分块门控 Delta Rule 递推
│       │   │   │   ├── ai_infra_esa_select_topk/               # ESA TopK 选择
│       │   │   │   ├── ai_infra_lower_triangular_inverse/      # 下三角矩阵求逆
│       │   │   │   └── ai_infra_quant_lightning_indexer/        # 量化 Lightning Indexer
│       │   │   ├── mhc/                                        # Manifold Constrained Hyper Connection
│       │   │   │   ├── ai_infra_mhc_pre_split_post_res/        # MHC Pre Split Post Res
│       │   │   │   └── ai_infra_mhc_sandwich_norm_post_preonly/ # MHC Sandwich Norm Post
│       │   │   ├── index/                                      # 索引类算子
│       │   │   │   └── ai_infra_scatter_block_update/          # Scatter Block Update
│       │   │   └── posembedding/                               # 位置编码类算子
│       │   │       └── ai_infra_kv_rms_norm_rope_cache/        # KV RMSNorm RoPE Cache
│       │   ├── tests/                                          # 测试框架代码
│       │   └── utils/                                          # 公共工具
│       └── torch_ops_extension/                                # PyTorch 算子扩展接口
│           └── omni_custom_ops/                                # 推理自定义算子包
│
├── training/                                                   # 训练场景算子
│   ├── ascendc/                                                # AscendC 实现
│   │   ├── cmake/                                              # CMake 构建模块与脚本
│   │   ├── src/                                                # 源码目录
│   │   │   ├── ops-transformer/                                # Transformer 类算子
│   │   │   │   ├── attention/                                  # 注意力类算子
│   │   │   │   │   ├── flash_attention_score_enhance/          # Flash Attention Score 增强
│   │   │   │   │   ├── flash_attention_score_grad_enhance/     # Flash Attention Score 反向增强
│   │   │   │   │   ├── sparse_flash_attention_enhance/         # 稀疏 Flash Attention 增强
│   │   │   │   │   ├── sparse_flash_attention_grad_enhance/    # 稀疏 Flash Attention 反向增强
│   │   │   │   │   ├── lightning_indexer_enhance/              # Lightning Indexer 增强
│   │   │   │   │   └── sparse_lightning_indexer_grad_kl_loss_enhance/ # 稀疏 Lightning Indexer KL Loss 反向
│   │   │   │   ├── mhc/                                       # MHC 算子（含前向/反向）
│   │   │   │   │   ├── ai_infra_manifold_constrained_hyper_connection_pre/ # MHC 前处理
│   │   │   │   │   ├── ai_infra_manifold_constrained_hyper_connection_pre_grad/ # MHC 前处理反向
│   │   │   │   │   ├── ai_infra_manifold_constrained_hyper_connection_post/ # MHC 后处理
│   │   │   │   │   ├── ai_infra_manifold_constrained_hyper_connection_post_grad/ # MHC 后处理反向
│   │   │   │   │   ├── ai_infra_mhc_post_grad/                # MHC Post 反向
│   │   │   │   │   ├── ai_infra_sinkhorn_grad/                 # Sinkhorn 反向
│   │   │   │   │   └── manifold_constrained_hyper_connection_sinkhorn_enhance/ # MHC Sinkhorn 增强
│   │   │   │   ├── mome/                                      # MoME 混合专家算子
│   │   │   │   │   ├── ai_infra_aggregate_hidden/              # 聚合hidden state
│   │   │   │   │   └── ai_infra_aggregate_hidden_grad/         # 聚合hidden state反向
│   │   │   │   └── common/                                    # 公共组件
│   │   │   ├── ops-communication/                              # 集合通信扩展算子（基于 HCCL）
│   │   │   │   ├── common/                                     # 公共组件
│   │   │   │   └── ai_infra_all_gather_batch/                  # Batched AllGather
│   │   │   ├── tests/                                          # 测试框架代码
│   │   │   └── utils/                                          # 公共工具
│   │   └── torch_ops_extension/                                # PyTorch 算子扩展接口
│   │       └── omni_training_custom_ops/                       # 训练自定义算子包
│   │
│   ├── pypto/                                                  # PyTorch PTO 实现
│   │   └── src/                                                # 源码目录
│   │       └── ops-nn/                                         # NN 类算子
│   │           └── quant/                                      # 量化算子
│   │
│   └── triton/                                                 # Triton 实现
│       └── src/                                                # 源码目录
│           └── ops_transformer/                                # Transformer 类算子
│               └── attention/                                  # 注意力类算子
│                   └── gated_delta_net/                         # 门控 Delta Net
│
└── .gitee/                                                     # Gitee 配置（Issue 模板等）
```

## 算子列表

### 推理算子 (inference/ascendc)

| 分类 | 算子名 | 说明 |
|------|--------|------|
| Attention | ai_infra_sparse_flash_attention_gqa | 稀疏 Flash Attention (GQA) |
| Attention | ai_infra_sparse_flash_attention_pioneer | 稀疏 Flash Attention (Pioneer) |
| Attention | ai_infra_kv_quant_sparse_flash_attention | KV 量化稀疏 Flash Attention |
| Attention | ai_infra_fused_infer_attention_sink | 融合推理 Attention Sink |
| Attention | ai_infra_fused_causal_conv1d | 融合因果一维卷积 |
| Attention | ai_infra_causal_conv1d_add | 因果一维卷积加法 |
| Attention | ai_infra_chunk_gated_delta_rule_recurrence | 分块门控 Delta Rule 递推 |
| Attention | ai_infra_esa_select_topk | ESA TopK 选择 |
| Attention | ai_infra_lower_triangular_inverse | 下三角矩阵求逆 |
| Attention | ai_infra_quant_lightning_indexer | 量化 Lightning Indexer |
| MHC | ai_infra_mhc_pre_split_post_res | MHC Pre Split Post Res |
| MHC | ai_infra_mhc_sandwich_norm_post_preonly | MHC Sandwich Norm Post |
| Index | ai_infra_scatter_block_update | Scatter Block Update |
| PosEmbedding | ai_infra_kv_rms_norm_rope_cache | KV RMSNorm RoPE Cache |

### 训练算子 (training/ascendc)

| 分类 | 算子名 | 说明 |
|------|--------|------|
| Attention | flash_attention_score_enhance | Flash Attention Score 增强 |
| Attention | flash_attention_score_grad_enhance | Flash Attention Score 反向增强 |
| Attention | sparse_flash_attention_enhance | 稀疏 Flash Attention 增强 |
| Attention | sparse_flash_attention_grad_enhance | 稀疏 Flash Attention 反向增强 |
| Attention | lightning_indexer_enhance | Lightning Indexer 增强 |
| Attention | sparse_lightning_indexer_grad_kl_loss_enhance | 稀疏 Lightning Indexer KL Loss 反向 |
| MHC | ai_infra_manifold_constrained_hyper_connection_pre | MHC 前处理 |
| MHC | ai_infra_manifold_constrained_hyper_connection_pre_grad | MHC 前处理反向 |
| MHC | ai_infra_manifold_constrained_hyper_connection_post | MHC 后处理 |
| MHC | ai_infra_manifold_constrained_hyper_connection_post_grad | MHC 后处理反向 |
| MHC | ai_infra_mhc_post_grad | MHC Post 反向 |
| MHC | ai_infra_sinkhorn_grad | Sinkhorn 反向 |
| MHC | manifold_constrained_hyper_connection_sinkhorn_enhance | MHC Sinkhorn 增强 |
| MoME | ai_infra_aggregate_hidden | 聚合hidden state |
| MoME | ai_infra_aggregate_hidden_grad | 聚合hidden state反向 |
| HCCL | ai_infra_all_gather_batch | Batched AllGather |

### 训练算子 (training/triton)

| 分类 | 算子名 | 说明 |
|------|--------|------|
| Attention | gated_delta_net | 门控 Delta Net |

### 训练算子 (training/pypto)

| 分类 | 算子名 | 说明 |
|------|--------|------|
| Quant | quant | 量化算子 |

## 技术栈

- **AscendC**: 昇腾原生 C/C++ 算子开发框架，直接操作 NPU 硬件资源，性能最优
- **Triton**: 基于 Triton 语言的算子开发，适合快速原型验证
- **PyPTO**: 基于 PyTorch 的算子开发方式

## 硬件支持

- 昇腾 910B (Atlas A2)
- 昇腾 910C (Atlas A3)
- 昇腾 950PR (Atlas A5)

## 📝相关信息

- [许可证](./LICENSE)

    omni-ops仓涉及的算子，如算子目录下存在License的以该License为准。如算子目录下不存在License的，遵循CANN 2.0协议，对应协议文本可查阅[LICENSE](./LICENSE)
- [免责声明](./DISCLAIMER.md)