# AoC2025 **Repository Path**: zrainx/AoC2025 ## Basic Information - **Project Name**: AoC2025 - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-31 - **Last Updated**: 2025-10-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AoC2025 VGG accelerator refer to Eyeriss ## AI model Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 32, 32, 32] 896 BatchNorm2d-2 [-1, 32, 32, 32] 64 ReLU-3 [-1, 32, 32, 32] 0 Conv2d-4 [-1, 32, 32, 32] 9,248 BatchNorm2d-5 [-1, 32, 32, 32] 64 ReLU-6 [-1, 32, 32, 32] 0 MaxPool2d-7 [-1, 32, 16, 16] 0 Conv2d-8 [-1, 64, 16, 16] 18,496 BatchNorm2d-9 [-1, 64, 16, 16] 128 ReLU-10 [-1, 64, 16, 16] 0 Conv2d-11 [-1, 64, 16, 16] 36,928 BatchNorm2d-12 [-1, 64, 16, 16] 128 ReLU-13 [-1, 64, 16, 16] 0 MaxPool2d-14 [-1, 64, 8, 8] 0 Conv2d-15 [-1, 128, 8, 8] 73,856 BatchNorm2d-16 [-1, 128, 8, 8] 256 ReLU-17 [-1, 128, 8, 8] 0 Conv2d-18 [-1, 128, 8, 8] 147,584 BatchNorm2d-19 [-1, 128, 8, 8] 256 ReLU-20 [-1, 128, 8, 8] 0 MaxPool2d-21 [-1, 128, 4, 4] 0 Conv2d-22 [-1, 256, 4, 4] 295,168 BatchNorm2d-23 [-1, 256, 4, 4] 512 ReLU-24 [-1, 256, 4, 4] 0 Conv2d-25 [-1, 256, 4, 4] 590,080 BatchNorm2d-26 [-1, 256, 4, 4] 512 ReLU-27 [-1, 256, 4, 4] 0 MaxPool2d-28 [-1, 256, 2, 2] 0 Linear-29 [-1, 256] 262,400 ReLU-30 [-1, 256] 0 Linear-31 [-1, 128] 32,896 ReLU-32 [-1, 128] 0 Linear-33 [-1, 10] 1,290 ================================================================ Total params: 1,470,762 Trainable params: 1,470,762 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.01 Forward/backward pass size (MB): 2.94 Params size (MB): 5.61 Estimated Total Size (MB): 8.56 ---------------------------------------------------------------- ### hyperparameters | Hyperparameter | Loss function | Optimizer | Scheduler | Weight decay or Momentum | Epoch | | -------------- | ------------- | --------- | --------- | ------------------------ | ----- | | Value |CrossEntropyLoss|SGD|CosineAnnealingLR|Weight decay : 5e-4, Momentum : 0.9|50| ### Quantization Given a **linear layer (128 → 10)** with an input shape of 1×128 and an output shape of 1×10, along with the energy costs for different data types, we will use the provided table to estimate the total energy consumption for executing such a fully connected layer during inference under the following two scenarios: 1. Full precision (FP32) 2. 8-bit integer, power-of-2, static, uniform symmetric quantization - activation: UINT8 - weight: INT8 | Operation | Energy consumption (pJ) | | -------------------------------- | -------------------------- | | FP32 Multiply | 3.7 | | FP32 Add | 0.9 | | INT32 Add | 0.1 | | INT8 / UINT8 Multiply | 0.2 | | INT8 / UINT8 Add | 0.03 | | Bit Shift | 0.01 | ## Hardware ### Hardware Architecture ![hw](./picture/Hardware_Design.png) #### PE : Load Ifmap When loading the ifmap, first subtract the zero-point. Since using uint8 symmetric quantization, the zero-point is 128. The PE computes a 1-D convolution primitive. The PE waits for the testbench to assert `PE_en`. Once `PE_en` is high, `PE_config` becomes valid, and the PE reads its contents to determine the information for the current 1-D convolution computation. Afterward, the testbench starts sending and retrieving data through a handshake mechanism. The process continues until the computation is complete. The following is the definition of `PE_config` ![config](./picture/image.png) `mode = 0` means `CONV` `mode = 1` means `FC Layer` #### PE Array ![GINON](./picture/GINON.png) #### PPU #### GLB #### Controller #### DMA #### Insturction - PE test ``` make pe% ``` ## HAL ![HAL](./picture/HAL_structrue.png) In real embedded systems, software (runtime) typically interacts with hardware through Memory-Mapped I/O (MMIO) or device drivers, enabling indirect access to registers or peripherals. However, in this lab, we do not have access to actual hardware. Instead, we use Verilator to translate Verilog RTL into cycle-accurate C++ models, generating a Verilated model for simulation. ## Compiler ![Py_Arc](./picture/CodeArc.png) In this lab, we implement a lightweight C code generator using Python, allowing for faster prototyping without the need to recompile TVM. This Python-based flow simplifies the integration of customized code generation logic while maintaining compatibility with the Relay model structure. **`codegen.py`**: Responsible for generating the full C source code required for model inference. This includes emitting function declarations and layer-wise computations, as well as embedding model weights. **`datagen.py`**: Handles the transformation of sample input datasets into a runtime-friendly binary format. A lightweight header is added to assist with input parsing during execution. **`note.py`**: Serves as a configuration and pattern-matching module. It defines wildcard variable names for pattern recognition and maps fused composite functions to their corresponding C code templates. This modular design not only increases code readability and reusability but also separates concerns clearly, making it easier to maintain and extend the system for different target hardware or model structures. AI compilers enable the deployment of models from high-level frameworks like TensorFlow and PyTorch onto various hardware platforms such as CPUs, GPUs, and AI accelerators by transforming high-level code into low-level executable code. In this lab, objective is to take the input Relay, apply fusion techniques to combine specific operators, and generate a fused Relay. Subsequently, performing code generation on the fused Relay to produce the output files marked in green, located along the designated path on the right. ### ![IR](./picture/IR_afterfuse.svg)