# AIALoopMode **Repository Path**: xfluidsolid/aialoop-mode ## Basic Information - **Project Name**: AIALoopMode - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-09 - **Last Updated**: 2021-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # OptimizationPlatform The platform is used to optimize some host codes in GPU by cuda c language. The code just runs in one GPU. ## 1. how to compile and run ### 1.1 compile the code [username]$make By make, the execution file `OptimizeGPU` is established. ### 1.2 run the code [username]$cd Exaple [username]$yhrun -n 1 ../OptimizeGPU ### 1.3 precision Both float and double precisions are applied. The precision can be chosen in makefile. For example, by below set in makefile, float precision is chosen. > PRECISION = SINGLEPRECISION > ##PRECISION = DOUBLEPRECISION Similarly, double precision is chosen by below set. > ##PRECISION = SINGLEPRECISION > PRECISION = DOUBLEPRECISION ### 1.4 Simulation Results The test case is in Example, which contains the geometry topology of unstructured mesh. All of GPU functoins and host functions in runXXXX framework runs and the results are validated like: ``` setNcountQNodeTNodeZeroDevice Validate CellLoopFinalNoAtomic The maximum error is 3.996999740600585937500000000000e+00 on 2, 15039 term of qNode The maximum error is 2.609254121780395507812500000000e+00 on 0, 2989 term of tNode setNcountQNodeTNodeZeroDevice Validate CellLoopFinal The maximum error is 1.430511474609375000000000000000e-06 on 2, 115185 term of qNode The maximum error is 9.536743164062500000000000000000e-07 on 0, 44647 term of tNode ... ... ``` The maximum error is the difference between host and gpu functions. After the simulation, `performanceCPU` and `performanceGPU` is the final results. The executing time is recorded in performanceCPU and performanceGPU. ### 1.5 Explain of performanceGPU and performanceCPU ``` No. Name Parent Elapsed Time frequency 1 GPUNodeLoopFinal1 CallGPUNodeLoopNCountQNodeTNodeCalFinal 0.038987 100 2 GPUNodeLoopFinal2 CallGPUNodeLoopNCountQNodeTNodeCalFinal 0.044714 1 ``` No. is the label of a timer. Name is the name of a timer. Parent is the function name, where the computing spot in. Elapsed time is the time spent in computing spot. frequency is the frequency of computing spot. ## 2. code structure Main RunOptimizations HostFunctions GPUKernels Validation GlobalVariables DeviceControl Timer Main: chose optimization module to run, create and output timers RunOptimizations: optimization framework HostFunctions: The function should be ported to GPU, the original face loop is often changed into volume-loop or node-loop GPUKernels: GPU kernels corresponding to host functions and some optimized forms Validations: Test functions for comparing results between GPUKernels and HostFunctions GlobalVariables: Host variables and Device variables DeviceControl: control GPU Timer: record executing time of CPU and GPU ## 3. programer guide ### 3.1 Main/OptimizationFrame.cu: main function runXXXX is the framework for host and gpu functions. ```c++ //runBoundaryQlQrFix(); //Optimize GPUBoundaryQlQrFix runQTNcountCompInterior(); //Optimize GPUInteriorFaceNCountQNodeTNodeCal //runCompGradientGGNodeFaceCal();//Optimize CompGradientGGNodeInteriorFaceCal //runReconstructFaceValue();//Optimize GPUReconstructFaceValue ``` 4 framework have been established. In one simulation, you had better just choose one framework to run. Many framework may make so many results. ### 3.2 RunOptimizations/runXXXXX.cu: framework to collect gpu and host functions Taking runQTNcountCompInterior.cu for instance: The entrance function is `runQTNcountCompInterior()`. ```c++ void runQTNcountCompInterior(){ preProcessQTNcountCompInterior(); int loopID; for (loopID = 0; loopID < LOOPNUM; loopID++){ ... ...; //host and gpu functions, or validatoin functions } } ``` The function runs `preProcessQTNcountCompInterior()` firstly. Then, gpu, host functions or validation functions are ran in one loop. The loop is used for the average of executing time. In the loop, gpu and host functions are described as below ```c++ for (loopID = 0; loopID < LOOPNUM; loopID++){ CallHostCellLoopNCountQNodeTNodeCalFinal(loopID); CallGPUNodeLoopNCountQNodeTNodeCalFinal(loopID); CallHostNodeLoopNCountQNodeTNodeCalFinal(loopID); CallGPUFaceNCountQNodeTNodeCal(loopID); CallHostFaceNCountQNodeTNodeCal(loopID); CallGPUCellLoopNCountQNodeTNodeCalFinal(loopID); } ``` CallHostXXX is host function in HostFunctions/HostXXXX.cu CallGPUXXX is gpu function in GPUkernels/GPUXXXX.cu