1.4K Star 7.6K Fork 1.4K

GVP方舟编译器 / OpenArkCompiler

 / 详情

mbc 方式 coremark 性能分析

待办的
成员
创建于  
2022-05-26 09:25

热点函数分析 目的是为了找出哪些函数maple 比gcc慢。

maple 编译 coremark的方法见:

https://gitee.com/openarkcompiler/OpenArkCompiler/wikis/lmbc-testsuits

perf 采集命令:

perf record -e cycles,instructions -g ./coremark.exe 0x0 0x0 0x66 200000 7 1 2000
perf report -n --no-children  -f -i perf.data
  • gcc的热点函数分布
Samples: 41K of event 'cycles', Event count (approx.): 26784499276, DSO: coremark.exe
  Overhead       Samples  Command       Symbol
+   20.38%          8459  coremark.exe  [.] core_list_find
+   17.18%          7136  coremark.exe  [.] core_bench_list
+   15.48%          6421  coremark.exe  [.] core_state_transition
+   13.28%          5517  coremark.exe  [.] crcu16
+    7.83%          3251  coremark.exe  [.] matrix_mul_matrix_bitextract
+    6.44%          2673  coremark.exe  [.] matrix_test
+    6.09%          2529  coremark.exe  [.] matrix_mul_matrix
+    4.83%          2006  coremark.exe  [.] core_bench_state
+    4.20%          1743  coremark.exe  [.] core_list_mergesort
+    1.37%           568  coremark.exe  [.] calc_func
+    1.16%           481  coremark.exe  [.] cmp_idx
+    0.65%           268  coremark.exe  [.] matrix_mul_vect
+    0.57%           236  coremark.exe  [.] cmp_complex
     0.31%           129  coremark.exe  [.] crcu32
     0.16%            68  coremark.exe  [.] crc16
     0.02%            10  coremark.exe  [.] core_bench_matrix
     0.01%             3  coremark.exe  [.] iterate
  • maple的热点函数分布
Samples: 77K of event 'cycles', Event count (approx.): 50332083930, DSO: maple_origin_coremark.exe
  Overhead       Samples  Command          Symbol
+   28.76%         22409  maple_origin_co  [.] crcu16
+   21.81%         16990  maple_origin_co  [.] core_state_transition
+   13.85%         10788  maple_origin_co  [.] core_bench_list
+   12.13%          9447  maple_origin_co  [.] core_list_find
+    8.10%          6313  maple_origin_co  [.] matrix_test
+    6.85%          5336  maple_origin_co  [.] matrix_mul_matrix_bitextract
+    2.98%          2318  maple_origin_co  [.] core_list_mergesort
+    2.56%          1994  maple_origin_co  [.] core_bench_state
+    1.04%           811  maple_origin_co  [.] calc_func
+    0.67%           520  maple_origin_co  [.] cmp_idx
     0.49%           381  maple_origin_co  [.] crcu32
     0.32%           249  maple_origin_co  [.] crc16
     0.29%           227  maple_origin_co  [.] cmp_complex
     0.05%            40  maple_origin_co  [.] core_bench_matrix
     0.01%            11  maple_origin_co  [.] iterate

可以看出下列三个函数的samples比例maple是明显高于gcc的,分别是

  • crc16u
  • core_state_transition
  • core_bench_list

将gcc编译的crc16u替换到maple的编译结果中

crc16u函数定义在 core_util.c文件中,使用交叉编译工具链编译出core_util.s然后将crc16u的汇编替换到maple编译的core_util.s中。

# gcc
$MAPLE_ROOT/tools/gcc-linaro-7.5.0/bin/aarch64-linux-gnu-gcc  -O2 -Ilinux -Iposix -I. -DFLAGS_STR=\""-O2 -DPERFORMANCE_RUN=1  -lrt"\" -DITERATIONS=0 -DPERFORMANCE_RUN=1 -S core_util.c -o core_util.s -lrt

# maple
$MAPLE_ROOT/build/tools/common/maplec -O2 -Ilinux -Iposix -I. -DFLAGS_STR=\""-O2 -DPERFORMANCE_RUN=1  -lrt"\" -DITERATIONS=0 -DPERFORMANCE_RUN=1 core_util.c -c -s core_util.s -lrt

然后再重新链接。

最终执行结果如下:

gcc maple maple_replaced(crc16u) maple_replaced(crc16u+core_state_transition)
total times 10.262000 19.163932 14.582007 11.222703
percent 100% 53.54% 70.37% 91.51%

结合之前文权分析的结论,crcu8也有类似问题(入参不一样,可能导致执行路径也不一样),这两个函数代码段都是gcc的两倍之多,且都是热点函数。

core_state_transition

  • 关键路径应该不是switch 跳表,整体替换跑分成绩可以提升20%+,但是禁用跳表整体性能几乎没多少提升。
  • 也不是指令对齐方式的问题,gcc4字节对齐,maple32字节对齐(机器cache line 64字节)
  • branch-misses差异较大

maple:

Samples: 57K of event 'branch-misses', Event count (approx.): 266571429, DSO: replaced_crcu16_maple_coremark.exe
  Children      Self  Command          Symbol
+  100.00%     0.00%  replaced_crcu16  [.] _start
+   33.99%    33.97%  replaced_crcu16  [.] core_state_transition
+   29.24%    29.23%  replaced_crcu16  [.] core_bench_list
+   10.10%    10.08%  replaced_crcu16  [.] core_list_find
+    6.33%     6.32%  replaced_crcu16  [.] matrix_test
+    5.87%     5.86%  replaced_crcu16  [.] core_list_mergesort
+    4.19%     4.19%  replaced_crcu16  [.] core_bench_state
+    3.29%     3.29%  replaced_crcu16  [.] matrix_mul_matrix_bitextract
+    2.57%     2.57%  replaced_crcu16  [.] calc_func
+    1.86%     1.86%  replaced_crcu16  [.] crcu16
+    1.79%     1.79%  replaced_crcu16  [.] cmp_idx
+    0.61%     0.61%  replaced_crcu16  [.] cmp_complex
     0.13%     0.13%  replaced_crcu16  [.] crc16
     0.01%     0.01%  replaced_crcu16  [.] crcu32
     0.01%     0.01%  replaced_crcu16  [.] core_bench_matrix
     0.00%     0.00%  replaced_crcu16  [.] main

gcc

Samples: 41K of event 'branch-misses', Event count (approx.): 90178643, DSO: coremark.exe
  Children      Self  Command       Symbol
+   99.99%     0.00%  coremark.exe  [.] _start
+   99.99%     0.00%  coremark.exe  [.] main
+   99.98%     0.00%  coremark.exe  [.] iterate
+   89.04%     6.06%  coremark.exe  [.] core_bench_list
+   77.76%    17.44%  coremark.exe  [.] core_list_mergesort
+   59.98%     2.34%  coremark.exe  [.] cmp_complex
+   56.32%     4.75%  coremark.exe  [.] calc_func
+   28.92%     0.02%  coremark.exe  [.] core_bench_matrix
+   11.87%     5.52%  coremark.exe  [.] core_bench_state
+   10.92%    10.90%  coremark.exe  [.] crcu16
+   10.45%    10.45%  coremark.exe  [.] matrix_test
+    9.78%     9.73%  coremark.exe  [.] core_state_transition
+    9.27%     9.24%  coremark.exe  [.] matrix_mul_matrix_bitextract
+    9.22%     9.19%  coremark.exe  [.] core_list_find
+    8.32%     8.30%  coremark.exe  [.] matrix_mul_matrix
+    4.96%     4.96%  coremark.exe  [.] cmp_idx
+    0.86%     0.85%  coremark.exe  [.] matrix_mul_vect
     0.02%     0.02%  coremark.exe  [.] crcu32
     0.01%     0.01%  coremark.exe  [.] crc16

评论 (30)

wangshuai 创建了任务

maple:
输入图片说明
mbc:
输入图片说明
lmbc:
输入图片说明

result:

gcc maple mbc lmbc
total times 10.605992 19.268120 19.090785 19.305054
gcc maple mbc lmbc
total times 10.581693 19.197274 18.812054 19.202711
size 29024 29712 31534 30274
fredchow 添加协作者fredchow

Please use:

maple --run=me --option=-O2 --genlmbc

for producing the .lmbc file. The size should be smaller.

Inlining的优化,可以在输入mbc/lmbc后,在调用mplcg前做:

maple foo.lmbc --run=mpl2mpl:mplcg --option=-O2:-O2

gcc maple mbc lmbc
total times 10.591438 18.864150 18.920330 19.220367
siez 29024 29728 26504 25282

添加PR1153

gcc maple mbc lmbc
total times 10.591438 17.972678 17.797790 18.674899
siez 29024 29728 26504 25282

PR1153未合入,PR1153已取消CodeReady

gcc maple mbc lmbc
total times 10.592007 18.236831 18.222710 18.735336
size 29024 29728 26455 25242

PR1153 已经放弃了,因为conflicts太多。由其他PR取代。

gcc maple mbc lmbc
total times 10.574882 17.925919 17.945642 18.384274
size 29024 29728 26465 25252
gcc maple mbc lmbc
total times 10.593235 17.945715 17.976522 18.442803
size 29024 29728 26465 25252
gcc maple mbc lmbc
total times 10.585182 14.989517 15.002101 Segmentation fault
size 29024 29968 27482 26219

输入图片说明

过去一个星期,mplcg 有regression. 从 .lmbc开始,用一个星期前的mplcg,问题就看不到。

gcc maple mbc lmbc
total times 10.598195 14.996783 14.983015 14.689677
size 29024 29968 27465 26319
gcc maple mbc lmbc
total size 10.590252 15.024478 14.994562 14.645256
size 29024 29968 27309 26142
gcc maple mbc lmbc
total size 10.596213 15.010635 14.997347 Segmentation fault
size 29024 29968 27365 26496

输入图片说明

gcc maple mbc lmbc
total time 10.597565 15.015716 15.009774 13.724592
size 29024 29968 27365 26496

加PR1220之后的数据
跑lmbc的时候存在error
输入图片说明

gcc maple mbc lmbc
total time 10.584101 15.036662 14.996647 13.740912
size 29024 29968 27357 26488
gcc maple mbc lmbc
total size 10.588940 14.660109 14.674740 14.564830
size 29024 29968 27326 26457
gcc maple mbc lmbc
total size 10.583092 14.656732 14.678527 14.451514
size 29024 29968 27326 26457
gcc maple mbc lmbc
total size 10.599833 14.650416 14.669978 14.527161
size 29024 29968 27326 26457
gcc maple mbc lmbc
total size 10.603636 14.860165 14.848948 14.648584
size 29024 29968 27326 26457
gcc maple mbc lmbc
total size 10.614331 14.877399 14.851216 14.683168
size 29024 29968 27326 26457
gcc maple mbc lmbc
total time 10.625604 14.872185 14.878506 14.668003
size 29024 29992 27326 26457
gcc maple mbc lmbc
total times 10.598562 13.904174 13.891701 14.460430
size 29024 29992
gcc maple mbc lmbc
total times 10.599657 11.487677 11.449768 11.328723
size 29024 29992
gcc maple mbc lmbc
total times 10.590581 11.478291 11.464892 11.346839
size 29024 29992
gcc maple mbc lmbc
total times 10.615672 11.475406 11.450076 11.330784
size 29024 29992
gcc maple mbc lmbc
total size 10.613873 11.477163 11.452273 11.340038
size 29024 29992

登录 后才可以发表评论

状态
负责人
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
参与者(3)
C++
1
https://gitee.com/openarkcompiler/OpenArkCompiler.git
git@gitee.com:openarkcompiler/OpenArkCompiler.git
openarkcompiler
OpenArkCompiler
OpenArkCompiler

搜索帮助

344bd9b3 5694891 D2dac590 5694891