From 6b7eef54aab9dfb483407e326978676403d35cb9 Mon Sep 17 00:00:00 2001
From: broad-sea-life <1457653356@qq.com>
Date: Tue, 9 Sep 2025 10:11:02 +0800
Subject: [PATCH 1/2] =?UTF-8?q?=E4=BC=98=E5=8C=96README.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md | 99 +++++++++++++++++++++++++------------------------------
 1 file changed, 45 insertions(+), 54 deletions(-)
diff --git a/README.md b/README.md
index a3f512e..4c6f305 100644
--- a/README.md
+++ b/README.md
@@ -23,45 +23,54 @@
 
 ## 📥 参赛流程
 
-* 进入[GPUCodeForces赛事首页](https://gitee.com/ccf-ai-infra/GPUCodeForces)，登录参与本期比赛的Gitee账号，完成一份成功合并到仓库内的提交即为参赛成功！时间自由，方法自由，只要有灵感就可以动手开code~
+1. 访问GPUCodeForces赛事首页
+
+2. 登录Gitee账号，完成CLA签署
+
+3. Fork仓库到个人账户
+
+4. 创建或选择赛题进行优化
+
+5. 提交PR并等待审核
 
    ### 🌰举个栗子
 
-   * 登录or注册自己的Gitee账号后，进入赛事首页查看仓库内的文件内容。仔细阅读[how-to-contribute.md](https://gitee.com/ccf-ai-infra/GPUCodeForces/blob/main/how-to-contribute.md)，完成CLA签署，并熟悉提交流程。
+   * 登录or注册自己的Gitee账号后，进入赛事首页查看仓库内的文件内容。仔细阅读[how-to-contribute.md](https://gitee.com/ccf-ai-infra/GPUCodeForces/blob/main/how-to-contribute.md)，选择fork到自己gitee账号下的仓库内，然后完成CLA签署，并熟悉提交流程。
 
    * 看到仓库内文件，有一个example文件夹：
 
-     <img src="./images/readme_sample_check.png">
+     <img src="./images/readme_sample_check.png" class = center>
    
-     这是我们提供的一个样例，接下来我们在这个基础上进行一次完整的算子优化的提交（我们鼓励大家自己找到更好的算子并优化）。
+     这是我们提供的一个样例，着重关注关注四份文件：torchcode.py、prompt.txt、cudacode_ori.py、example_cudacode.py，接下来我们在这几份文件的基础上进行一次完整的算子优化，**最终需要提交的代码文件也正是这四个**（我们鼓励大家自己找到更好的算子并优化）。
 
-   * 我们将样例clone到自己电脑上，
-     并关注四份文件：  torchcode.py、prompt.txt、cudacode_ori.py、example_cudacode.py，最终需要提交的代码文件正是这四个。本次比赛在[模力方舟](https://ai.gitee.com/compute)平台上使用沐曦算力，需要使用算力券购买实例：
+   * 本次比赛在[模力方舟](https://ai.gitee.com/compute)平台上使用云端算力，需要使用算力券购买实例：
 
-     <img src="./images/readme_git_compute.png">
+     <img src="./images/readme_git_compute.png" class = center>
 
      接着便可以在云端实例上进行代码修改。相关算力券的领取方式请见[算力平台使用说明](https://ai.gitee.com/docs/compute/container)、[算力券兑换发放和兑换](https://ai.gitee.com/docs/billing/coupons)。
 
    * 然后在该比赛仓库新建一个issue，填写赛题。这里我们是对example-001算子优化，因此issue的主题就可以是“对001-example数据集进行性能优化”：
 
-     <img src="./images/readme_sample_issue.png">
+     <img src="./images/readme_sample_issue.png" class = center>
    
      可以看到这里有一个“#ICVXKH”，这是issue id，你的算子优化、新算子都应该绑定一个独立的issue id（最终有多少份issue被审核通过，就表示提交成功了多少份）。在即将提交的时候，在该赛题仓库的S1文件夹下新建一个以该id命名（无需带#号）的文件夹，该文件夹内容为四份必要文件和其他视参赛者情况需要补充的材料（如readme文件、用到的其他数据集等）：
 
-     <img src="./images/readme_sample_folder.png">
+     <img src="./images/readme_sample_folder.png" class = center>
 
    * 准备工作就绪，接下来看到example-001内的代码：
 
      **example_torchcode.py：** 基准模型（Baseline）。示例提供一个简单的PyTorch模型，只包含一个ReLU激活函数。
-      
-      * <span style="background-color: grey; color: black; user-select: none;">get_inputs()</span>：生成模型运行时需要的输入数据。
-      * <span style="background-color: grey; color: black; user-select: none;">get_init_inputs()</span>：成模型初始化所需的参数（这里就是权重矩阵 weight）。
 
-     <span style="background-color: grey; color: black; user-select: none;">example_cudacode.py</span>：优化模型。示例使用PyTorch的load_inline功能直接编译和加载CUDA代码，创建了一个新的模型类，使用自定义CUDA实现替代PyTorch的ReLU。
+     其中
+
+      * **get_inputs()**：生成模型运行时需要的输入数据。
+      * **get_init_inputs()**：成模型初始化所需的参数（这里就是权重矩阵 weight）。
 
-     <span style="background-color: grey; color: black; user-select: none;">run_code.py</span>：验证和性能测试脚本。验证自定义CUDA实现与原始PyTorch实现的数值精度一致性，比较两种实现的性能，计算加速比。
+     **example_cudacode.py**：优化模型。示例使用PyTorch的load_inline功能直接编译和加载CUDA代码，创建了一个新的模型类，使用自定义CUDA实现替代PyTorch的ReLU。
 
-     <span style="background-color: grey; color: black; user-select: none;">prompt.txt</span>：这里给予参赛者一些提示：
+     **run_code.py**：验证和性能测试脚本。验证自定义CUDA实现与原始PyTorch实现的数值精度一致性，比较两种实现的性能，计算加速比。
+
+     **prompt.txt**：这份文件给予参赛者一些提示：
 
        *  要求编写自定义CUDA内核来替换PyTorch算子以获得加速
 
@@ -73,13 +82,13 @@
 
     * 然后我们来优化这个算子：
 
-       <span style="background-color: skyblue; color: black; user-select: none;">example_torchcode.py</span>：从单一的ReLU操作扩展为矩阵乘法 + ReLU的复合操作，增加可学习的权重参数(weight)，指定输入张量的预期形状和输出形状。
+       **优化-example_torchcode.py**：从单一的ReLU操作扩展为矩阵乘法 + ReLU的复合操作，增加可学习的权重参数(weight)，指定输入张量的预期形状和输出形状。
 
-       <span style=" background-color: skyblue; color: black; user-select: none;">example_cudacode.py</span>：采用部分替换策略，只将ReLU替换为CUDA实现，保留PyTorch的高度优化矩阵乘法。
+       **优化-example_cudacode.py**：采用部分替换策略，只将ReLU替换为CUDA实现，保留PyTorch的高度优化矩阵乘法。
 
-       <span style="background-color: skyblue; color: black; user-select: none;">run_code.py</span>：增加了最大差异和平均差异计算，使用更严格的容差(1e-05)，增加了预热步骤，确保GPU状态稳定；增加迭代次数(1000次)提高测量准确性。
+       **优化-run_code.py**：增加了最大差异和平均差异计算，使用更严格的容差(1e-05)，增加了预热步骤，确保GPU状态稳定；增加迭代次数(1000次)提高测量准确性。
 
-       <span style="background-color: skyblue; color: black; user-select: none;">prompt.txt</span>：展示了加法操作和ReLU操作两种不同的优化示例,强调了可以自由选择优化策略，包括算子融合等高级技术。
+       **优化-prompt.txt**：展示了加法操作和ReLU操作两种不同的优化示例,强调了可以自由选择优化策略，包括算子融合等高级技术。
 
     * 优化好后，可以在模力方舟的实例上运行:
 
@@ -91,43 +100,26 @@
     * 接下来将优化好的代码保存到本地，然后参照[how-to-contribute.md](https://gitee.com/ccf-ai-infra/GPUCodeForces/blob/main/how-to-contribute.md)的指引进行代码仓库的提交与合并。
 
     * 最终，成功提交的代码会合并到 S1/#your_issue id 下，并且你的相关pr也会关闭。就像下面这样：
-    <img src="./images/readme_sample_merge.png">
+    <img src="./images/readme_sample_merge.png" class = center>
   
   🌳一份完整的提交流程如上，期待各位自由发挥，赛出风采与水平！
 
 
-### 📦 提交PR内容
+### 📦 提交PR格式
 
 *   **一个PR包含样本的目录** [提交样例](https://gitee.com/ccf-ai-infra/GPUCodeForces/tree/main/example/001-example)
     
-*   每个提交目录建议包含如下：
-    
-    1.  **示例代码：** torch代码示例
-        
-    2.  **对比代码：** 和torch对应的CUDA代码
-        
-    3.  **测试代码入口：** run\_code.py（请务必用这个名称，提交的PR会根据这个名称在GPU上测试结果）
-        
-    4.  **其它文件（或目录）：** prompt（利用LLM从torch代码生成cuda代码的prompt示例）或者其它优化代码
-        
-    5.  **PR目录说明文件：** https://gitee.com/ccf-ai-infra/GPUCodeForces/blob/main/example/001-example/readme.md
-        
 
-### 📦 提交PR的格式
-
-建议在开始做题目之前创建一个赛题，提交的PR和自己创建的赛题相关联。参赛选手在每个比赛周期的目录下（例如：第一期S1、第二期S2、第三期S3...）创建一个目录，目录名称赛题的ID（ICTXSZ)，例如：
-
-```plaintext
-.
-├── S1(说明：第一季比赛名称)
-│   ├── ICTXSZ(说明：以赛题ID命名的目录存放PR提交样本的目录)
-|   |   ├── 示例代码
-│   |   ├── 对比代码
-|   |   └── run_code.py（说明：测试入口代码，每次赛题提交必须包含这个入口测试程序<名字不可修改>，可修改内部代码。）
-│   └── ……
-└── S2(第二季比赛)
-    └── 赛题1
+*   每个提交目录建议包含如下：
 
+```text
+S1/（比赛周期）
+└── [Issue_ID]/（以赛题ID命名的目录）
+    ├── torchcode.py（示例代码）
+    ├── cudacode.py（对比代码）
+    ├── run_code.py（测试入口代码，必须使用此名称）
+    ├── prompt.txt（LLM生成提示）
+    └── 其他必要文件
 ```
 
 ### ⭐审核流程
@@ -146,10 +138,9 @@
 
 ### ⚠️注意事项
 
-1.  请勿抄袭他人代码或成果
+1.  请勿抄袭他人代码
     
-2.  请勿恶意提交（如相同算子多次提交、相近优化重复提交）  
-    \---相近优化：即指同一份参赛用例在优化后加速比没有提升或更低的用例提交
+2.  避免重复提交相同或相近优化
     
 3.  请遵守提交的格式要求、内容规范
     
@@ -211,8 +202,8 @@
 
 ## 📬 联系与帮助
 
-如需更多信息或格式说明，请查看官方文档或在本仓库提交[想法](https://gitee.com/ccf-ai-infra/GPUCodeForces/issues/new?template=feature.yml)进行讨论。  祝你挑战成功，贡献出高质量的 GPU 评测数据集！🚀
-
-## FAQ
+需要更多信息？请查看：
 
-[第一季FAQ参考](FAQ.md)
\ No newline at end of file
+ [官方文档](https://gitee.com/ccf-ai-infra/GPUCodeForces)
+ [提交想法讨论](https://gitee.com/ccf-ai-infra/GPUCodeForces/issues/new/choose)
+ [第一季FAQ](https://gitee.com/ccf-ai-infra/GPUCodeForces/blob/main/FAQ.md)
\ No newline at end of file
-- 
Gitee


From a4d4953fd32c4dfdea09c6e54ef0478a6668f3b6 Mon Sep 17 00:00:00 2001
From: broad-sea-life <1457653356@qq.com>
Date: Wed, 10 Sep 2025 12:12:44 +0800
Subject: [PATCH 2/2] add 002-example

---
 example/002-example/example_cudacode.py  | 48 +++++++++++++++
 example/002-example/example_torchcode.py | 20 ++++++
 example/002-example/prompt.txt           | 31 ++++++++++
 example/002-example/run_code.py          | 78 ++++++++++++++++++++++++
 4 files changed, 177 insertions(+)
 create mode 100644 example/002-example/example_cudacode.py
 create mode 100644 example/002-example/example_torchcode.py
 create mode 100644 example/002-example/prompt.txt
 create mode 100644 example/002-example/run_code.py

diff --git a/example/002-example/example_cudacode.py b/example/002-example/example_cudacode.py
new file mode 100644
index 0000000..03da087
--- /dev/null
+++ b/example/002-example/example_cudacode.py
@@ -0,0 +1,48 @@
+import torch
+from torch.utils.cpp_extension import load_inline
+
+# Swish激活函数的CUDA实现 (x * sigmoid(x))
+swish_source = """
+#include <torch/extension.h>
+#include <cuda_runtime.h>
+
+__global__ void swish_kernel(const float* x, float* y, int size) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < size) {
+        // 高效计算Swish: x * (1 / (1 + exp(-x)))
+        float val = x[idx];
+        float sigmoid = 1.0f / (1.0f + expf(-val));
+        y[idx] = val * sigmoid;
+    }
+}
+
+torch::Tensor swish_cuda(torch::Tensor x) {
+    auto size = x.numel();
+    auto y = torch::empty_like(x);
+    const int block_size = 256;
+    int num_blocks = (size + block_size - 1) / block_size;
+    swish_kernel<<<num_blocks, block_size>>>(x.data_ptr<float>(), y.data_ptr<float>(), size);
+    return y;
+}
+"""
+
+swish_cpp_source = """
+torch::Tensor swish_cuda(torch::Tensor x);
+"""
+
+# 编译内联CUDA代码
+swish = load_inline(
+    name="swish",
+    cpp_sources=swish_cpp_source,
+    cuda_sources=swish_source,
+    functions=["swish_cuda"],
+    verbose=True
+)
+
+class ModelNew(torch.nn.Module):
+    def __init__(self):
+        super(ModelNew, self).__init__()
+        self.swish = swish  # 包含自定义Swish算子的模块
+
+    def forward(self, x):
+        return self.swish.swish_cuda(x)
\ No newline at end of file
diff --git a/example/002-example/example_torchcode.py b/example/002-example/example_torchcode.py
new file mode 100644
index 0000000..cac6a94
--- /dev/null
+++ b/example/002-example/example_torchcode.py
@@ -0,0 +1,20 @@
+import torch
+import torch.nn as nn
+
+class Model(nn.Module):
+    def __init__(self):
+        super(Model, self).__init__()
+    
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # 使用Swish替代原始的ReLU
+        return x * torch.sigmoid(x)  # PyTorch内置Swish实现
+
+batch_size = 16
+dim = 16384
+
+def get_inputs():
+    x = torch.randn(batch_size, dim)
+    return [x]
+
+def get_init_inputs():
+    return []  # 不需要特殊初始化输入
\ No newline at end of file
diff --git a/example/002-example/prompt.txt b/example/002-example/prompt.txt
new file mode 100644
index 0000000..a9f5eb1
--- /dev/null
+++ b/example/002-example/prompt.txt
@@ -0,0 +1,31 @@
+Write a custom CUDA kernel that fuses matrix multiplication with GELU activation.
+
+The original architecture performs:
+1. Matrix multiplication: output = input @ weight.T + bias
+2. GELU activation: gelu_output = gelu(output)
+
+You should fuse these two operations into a single CUDA kernel to avoid:
+- Storing the intermediate matrix multiplication result to global memory
+- Reading it back for the GELU operation
+
+The GELU activation function can be approximated as:
+  gelu(x) = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
+
+Considerations:
+- Use 2D grid and block dimensions to parallelize over batch size and hidden features
+- Implement efficient shared memory usage for tiling if possible
+- Ensure numerical stability and precision
+
+You are given the following architecture:
+
+import torch
+import torch.nn as nn
+
+class Model(nn.Module):
+    def __init__(self, in_features=16384, hidden_features=4096):
+        super(Model, self).__init__()
+        self.linear = nn.Linear(in_features, hidden_features)
+    
+    def forward(self, x):
+        x = self.linear(x)
+        return torch.nn.functional.gelu(x)
\ No newline at end of file
diff --git a/example/002-example/run_code.py b/example/002-example/run_code.py
new file mode 100644
index 0000000..fbd9757
--- /dev/null
+++ b/example/002-example/run_code.py
@@ -0,0 +1,78 @@
+import torch
+import time
+from example_torchcode import Model, get_inputs, get_init_inputs
+from example_cudacode import ModelNew
+
+def run_benchmark():
+    if not torch.cuda.is_available():
+        print("CUDA 不可用")
+        return
+    
+    device = torch.device("cuda")
+    
+    # 准备输入数据
+    inputs = [x.cuda(device=device) for x in get_inputs()]
+    init_inputs = [x.cuda(device=device) if isinstance(x, torch.Tensor) else x for x in get_init_inputs()]
+
+    # 初始化模型
+    torch_model = Model(*init_inputs).cuda()
+    cuda_model = ModelNew(*init_inputs).cuda()
+
+    torch_model.eval()
+    cuda_model.eval()
+
+    print("-------------------- 精度对齐验证 --------------------")
+    with torch.no_grad():
+        # 预热GPU
+        _ = torch_model(*inputs)
+        _ = cuda_model(*inputs)
+        
+        # 正式测试
+        output_torch = torch_model(*inputs)
+        output_cuda = cuda_model(*inputs)
+
+    # 精度验证
+    abs_diff = torch.abs(output_torch - output_cuda)
+    max_diff = torch.max(abs_diff).item()
+    mean_diff = torch.mean(abs_diff).item()
+    
+    if max_diff < 1e-4 and mean_diff < 1e-5:
+        print(f"✅ 精度对齐：最大误差 {max_diff:.6f}，平均误差 {mean_diff:.6f}")
+        precision_flag = True
+    else:
+        print(f"❌ 精度不一致：最大误差 {max_diff:.6f}，平均误差 {mean_diff:.6f}")
+        precision_flag = False
+
+    print("\n-------------------- 性能加速比测试 --------------------")
+    num_iterations = 100 
+    
+    # 预热GPU
+    for _ in range(10):
+        _ = torch_model(*inputs)
+        _ = cuda_model(*inputs)
+    
+    # PyTorch模型计时
+    torch.cuda.synchronize()
+    start_time = time.time()
+    for _ in range(num_iterations):
+        _ = torch_model(*inputs)
+    torch.cuda.synchronize()
+    torch_time = (time.time() - start_time) / num_iterations
+    
+    # 自定义CUDA内核计时
+    torch.cuda.synchronize()
+    start_time = time.time()
+    for _ in range(num_iterations):
+        _ = cuda_model(*inputs)
+    torch.cuda.synchronize()
+    cuda_time = (time.time() - start_time) / num_iterations
+    
+    print(f"PyTorch内置Swish平均执行时间: {torch_time:.6f}秒")
+    print(f"自定义CUDA Swish平均执行时间: {cuda_time:.6f}秒")
+    speedup = torch_time / cuda_time if cuda_time > 0 else 0
+    print(f"加速比 (Speedup): {speedup:.2f}x")
+    
+    return precision_flag, speedup
+
+if __name__ == "__main__":
+    precision_flag, speedup = run_benchmark()
\ No newline at end of file
-- 
Gitee