This article describes how to use MindSpore Profiler for performance debugging on Ascend AI processors.
/home/user/code/data/
, the summary-base-dir should be /home/user/code
. After MindInsight is started, access the visualization page based on the IP address and port number. The default access IP address is http://127.0.0.1:8080
.To enable the performance profiling of neural networks, MindSpore Profiler APIs should be added into the script. At first, the MindSpore Profiler
object need to be set after set_context
is set and before the network and HCCL initialization. Then, at the end of the training, Profiler.analyse()
should be called to finish profiling and generate the perforamnce analyse results.
The parameters of Profiler are as follows:
https://www.mindspore.cn/docs/api/en/r1.5/api_python/mindspore.profiler.html
The sample code is as follows:
import numpy as np
from mindspore import nn, context
from mindspore import Model
import mindspore.dataset as ds
from mindspore.profiler import Profiler
class Net(nn.Cell):
def __init__(self):
super(Net, self).__init__()
self.fc = nn.Dense(2, 2)
def construct(self, x):
return self.fc(x)
def generator():
for i in range(2):
yield (np.ones([2, 2]).astype(np.float32), np.ones([2]).astype(np.int32))
def train(net):
optimizer = nn.Momentum(net.trainable_params(), 1, 0.9)
loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True)
data = ds.GeneratorDataset(generator, ["data", "label"])
model = Model(net, loss, optimizer)
model.train(1, data)
if __name__ == '__main__':
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
# Init Profiler
# Note that the Profiler should be initialized after context.set_context and before model.train
# If you are running in parallel mode on Ascend, the Profiler should be initialized before HCCL
# initialized.
profiler = Profiler(output_path='./profiler_data')
# Train Model
net = Net()
train(net)
# Profiler end
profiler.analyse()
The MindInsight launch command can refer to MindInsight Commands.
Users can access the Training Performance by selecting a specific training from the training list, and click the performance profiling link.
Figure 1: Overall Performance
Figure 1 displays the overall performance of the training, including the overall data of Step Trace, Operator Performance, Data Preparation Performance and Timeline. The data shown in these components include:
Users can click the detail link to see the details of each components. Besides, MindInsight Profiler will try to analyse the performance data, the assistant on the left will show performance tuning suggestions for this training.
The Step Trace Component is used to show the general performance of the stages in the training. Step Trace will divide the training into several stages:
Step Gap (The time between the end of one step and the computation of next step), Forward/Backward Propagation, All Reduce and Parameter Update. It will show the execution time for each stage, and help to find the bottleneck stage quickly.
Step Trace does not support heterogeneous training currently.
Figure 2: Step Trace Analysis
Figure 2 displays the Step Trace page. The Step Trace detail will show the start/finish time for each stage. By default, it shows the average time for all the steps. Users can also choose a specific step to see its step trace statistics.
The graphs at the bottom of the page show the execution time of Step Interval, Forward/Backward Propagation and Step Tail (The time between the end of Backward Propagation and the end of Parameter Update) changes according to different steps, it will help to decide whether we can optimize the performance of some stages. Here are more details:
In order to divide the stages, the Step Trace Component need to figure out the forward propagation start operator and the backward propagation end operator. MindSpore will automatically figure out the two operators to reduce the profiler configuration work. The first operator after get_next
will be selected as the forward start operator and the operator before the last all reduce will be selected as the backward end operator.
However, Profiler do not guarantee that the automatically selected operators will meet the user's expectation in all cases. Users can set the two operators manually as follows:
PROFILING_FP_START
to configure the forward start operator, for example, export PROFILING_FP_START=fp32_vars/conv2d/BatchNorm
.PROFILING_BP_END
to configure the backward end operator, for example, export PROFILING_BP_END=loss_scale/gradients/AddN_70
.The operator performance analysis component is used to display the execution time of the operators(AICORE/AICPU/HOSTCPU) during MindSpore run.
Figure 3: Statistics for Operator Types
Figure 3 displays the statistics for the operator types, including:
Figure 4: Statistics for Operators
Figure 4 displays the statistics table for the operators, including:
Statistics for the information related to calculation quantity of AICORE operator, including operator level and model level information.
The Calculation Quantity Analysis module shows the actual calculation quantity data, including calculation quantity data for operator granularity, scope level granularity, and model granularity. The actual calculation quantity refers to the amount of calculation that is running on the device, which is different from the theoretical calculation quantity. For example, the matrix computing unit on the Ascend910 device is dealing with a matrix of 16x16 size, so in the runtime, the original matrix will be padded to 16x16. Only calculation quantity on AICORE devices is supported currently. The information about calculation quantity has three indicators:
Figure 5: Calculation Quantity Analysis
The red box in Figure 5 includes calculation quantity data on operator granularity, scope level granularity, and model granularity. Click the "details" to see the scope level calculation quantity data.
Figure 6: Scope Level FLOPs
Figure 6 is a sankey diagram that presents data in the structure of a tree where the cursor selects a scope to see the specific FLOPs value.
The Data preparation performance analysis component is used to analyse the execution of data input pipeline for the training. The data input pipeline can be divided into three stages:
the data process pipeline, data transfer from host to device and data fetch on device. The component will analyse the performance of each stage in detail and display the results.
Figure 7: Data Preparation Performance Analysis
Figure 7 displays the page of data preparation performance analysis component. It consists of two tabs: the step gap and the data process.
The step gap page is used to analyse whether there is performance bottleneck in the three stages. We can get our conclusion from the data queue graphs:
Figure 8: Data Process Pipeline Analysis
Figure 8 displays the page of data process pipeline analysis. The data queues are used to exchange data between the data processing operators. The data size of the queues reflect the data consume speed of the operators, and can be used to infer the bottleneck operator. The queue usage percentage stands for the average value of data size in queue divide data queue maximum size, the higher the usage percentage, the more data that is accumulated in the queue. The graph at the bottom of the page shows the data processing pipeline operators with the data queues, the user can click one queue to see how the data size changes according to the time, and the operators connected to the queue. The data process pipeline can be analysed as follows:
To optimize the performance of data processing operators, there are some suggestions:
num_parallel_workers
.num_parallel_workers
and replace the operator to MindRecordDataset
.num_parallel_workers
. If it is a python operator, try to optimize the training script.prefetch_size
.The Timeline component can display:
Scope Name
of the operator, the number of each operator's Scope Name
could be selected and download corresponding timeline file. For example, the full name of one operator is Default/network/lenet5/Conv2D-op11
, thus the first Scope Name
of this operator is Default
, the second Scope Name
is network
. If two Scope Name
for each operator is selected, then the Default
and network
will be displayed.Users can get the most detailed information from the Timeline:
Users can click the download button on the overall performance page to view Timeline details. The Timeline data file (json format) will be stored on local machine, and can be displayed by tools. We suggest to use chrome://tracing
or Perfetto to visualize the Timeline.
Figure 9: Timeline Analysis
The Timeline consists of the following parts:
Device and Stream List: It will show the stream list on each device. Each stream consists of a series of tasks. One rectangle stands for one task, and the area stands for the execution time of the task.
Each color block represents the starting time and length of operator execution. The detailed explanation of timeline is as follows:
The Operator Information: When we click one task, the corresponding operator of this task will be shown at the bottom.
W/A/S/D can be applied to zoom in and out of the Timeline graph.
Resource utilization includes cpu usage analysis and memory usage analysis.
Figure 10:Overview of resource utilization
Overview of resource utilization:Including CPU utilization analysis and memory usage analysis. You can view the details by clicking the View Details button in the upper right corner.
CPU utilization, which is mainly used to assist performance debugging. After the performance bottleneck is determined according to the queue size, the performance can be debugged according to the CPU utilization (if the user utilization is too low, increase the number of threads; if the system utilization is too high, decrease the number of threads). CPU utilization includes CPU utilization of the whole machine, process and Data pipeline operator.
Figure 11: CPU utilization of the whole machine
CPU utilization of the whole machine: Show the overall CPU usage of the device in the training process, including user utilization, system utilization, idle utilization, IO utilization, current number of active processes, and context switching times. If the user utilization is low, you can try to increase the number of operator threads to increase the CPU utilization; if the system utilization is high, and the number of context switching and CPU waiting for processing is large, it indicates that the number of threads needs to be reduced accordingly.
Figure 12: Process utilization
Process utilization: Show the CPU usage of a single process. The combination of whole machine utilization and process utilization can determine whether other processes affect the training process.
Figure 13: Operator utilization
Operator utilization: Show the CPU utilization of Data pipeline single operator. We can adjust the number of threads of the corresponding operator according to the actual situation. If the number of threads is small and takes up a lot of CPU, you can consider whether you need to optimize the code.
Common scenarios of CPU utilization:
The default sampling interval is 1000ms. You can change the sampling interval through
mindspore.dataset.config.get_monitor_sampling_interval()
. For details:
This page is used to show the memory usage of the neural network model on the device, which is an ideal prediction based on the theoretical calculation results. The content of the page includes:
Operator Memory Allocation
.Memory Analysis does not support heterogeneous training currently.
Figure 14:Memory Analysis
Users can obtain the summary of memory usage via the Memory Allocation Overview
. In addition, they can obtain more detailed information from Memory Usage
, including:
Forward Propagation
and the end of Backward Propagation
of the model on the line chart.Operator Memory Allocation
. The table shows the memory decomposition of the corresponding execution position, i.e., the output tensor of which operators are allocated the occupied memory of the current execution position. The module provides users with abundant information, including tensor name, tensor size, tensor type, data type, shape, format, and the active lifetime of tensor memory.Figure 15:Memory Statistics
To limit the data size generated by the Profiler, MindInsight suggests that for large neural network, the profiled steps should be less than 10.
The number of steps can be controlled by controlling the size of training data set. For example, the
num_samples
parameter inmindspore.dataset.MindDataset
can control the size of the data set. For details, please refer to: https://www.mindspore.cn/docs/api/en/r1.5/api_python/dataset/mindspore.dataset.MindDataset.html
The parse of Timeline data is time consuming, and usually the data of a few steps is enough to analyze the results. In order to speed up the data parse and UI display, Profiler will show at most 20M data (Contain 10+ step information for large networks).
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。