复现GLIP_for_PyTorch代码,训练30个iter之后出现问题

一、问题现象（附报错日志上下文）：
在8卡910b的设备上,复现Ascend / ModelZoo-PyTorch中的GLIP_for_PyTorch,gitee地址为https://portrait.gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/others/GLIP_for_PyTorch,执行train_full_8p.sh之后运行了0~29iter就终止了,NPU 1上报错如下:
![输入图片说明](https://foruda.gitee.com/images/1717753117044527457/5b82a896_13180293.png "屏幕截图")
AllReduce error, see details in Ascend logs.EZ9999: Inner Error!
EZ9999  The error from device(chipId:1, dieId:0), serial number is 6, there is an aivec error exception, core id is 9, error code = 0x800000, dump info: pc start: 0x1240c13f6000, current: 0x1240c13f69c0, vec error info: 0xf114939d35, mte error info: 0xda03044e44, ifu error info: 0x6a9b11e14c980, ccu error info: 0x378da35e19dc30bd, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1138]
        TraceBack (most recent call last):
        The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x3044e44, fixp_error1 info: 0xda fsmId:1, tslot:3, thread:0, ctxid:0, blk:29, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1150]
        Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1612]
        AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1483]
        Aicore kernel execute failed, device_id=1, stream_id=2, report_stream_id=2, task_id=24781, flip_num=15, fault kernel_name=MaskedScatter_9091eb90132dd3cea3334d8ee06a93da_high_performance__kernel0, program id=123, hash=11664980099145024469.[FUNC:GetError][FILE:stream.cc][LINE:1483]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1483]
        rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
        synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Traceback (most recent call last):
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 275, in <module>
    main()
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 265, in main
    model = train(cfg=cfg,
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 144, in train
    do_train(
  File "/home/ma-user/work/GLIP_for_PyTorch/maskrcnn_benchmark/engine/trainer.py", line 211, in do_train
    losses.backward()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: ACL stream synchronize failed, error code:507035
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x):  7.0.RC1
--Tensorflow/Pytorch/MindSpore 版本: PyTorch2.1.0
--Python 版本 (e.g., Python 3.7.5): Python 3.9.18
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): EulerOS release 2.0 (SP8)

三、测试步骤：
在昇腾910b(64G*8)的设备上复现https://portrait.gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/others/GLIP_for_PyTorch,执行sh test/train_full_8p.sh

四、日志信息:
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/modeling_utils.py:1052: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/nn/functional.py:4140: UserWarning: nn.functional.upsample_bilinear is deprecated. Use nn.functional.interpolate instead.
  warnings.warn("nn.functional.upsample_bilinear is deprecated. Use nn.functional.interpolate instead.")
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/autograd/__init__.py:251: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [1, 256, 1, 1], strides() = [256, 1, 1, 1]
bucket_view.sizes() = [256], strides() = [1] (Triggered internally at /usr1/02/workspace/j_vqN6BFvg/pytorch/torch_npu/csrc/distributed/reducer.cpp:314.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
AllReduce error, see details in Ascend logs.EZ9999: Inner Error!
EZ9999  The error from device(chipId:1, dieId:0), serial number is 6, there is an aivec error exception, core id is 9, error code = 0x800000, dump info: pc start: 0x1240c13f6000, current: 0x1240c13f69c0, vec error info: 0xf114939d35, mte error info: 0xda03044e44, ifu error info: 0x6a9b11e14c980, ccu error info: 0x378da35e19dc30bd, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1138]
        TraceBack (most recent call last):
        The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x3044e44, fixp_error1 info: 0xda fsmId:1, tslot:3, thread:0, ctxid:0, blk:29, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1150]
        Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1612]
        AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1483]
        Aicore kernel execute failed, device_id=1, stream_id=2, report_stream_id=2, task_id=24781, flip_num=15, fault kernel_name=MaskedScatter_9091eb90132dd3cea3334d8ee06a93da_high_performance__kernel0, program id=123, hash=11664980099145024469.[FUNC:GetError][FILE:stream.cc][LINE:1483]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1483]
        rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
        synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Traceback (most recent call last):
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 275, in <module>
    main()
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 265, in main
    model = train(cfg=cfg,
  File "/home/ma-user/work/GLIP_for_PyTorch/tools/train_net.py", line 144, in train
    do_train(
  File "/home/ma-user/work/GLIP_for_PyTorch/maskrcnn_benchmark/engine/trainer.py", line 211, in do_train
    losses.backward()
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: ACL stream synchronize failed, error code:507035
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:372] Warning: NPU warning, error code is 507035[Error]: 
[Error]: The vector core execution is abnormal. 
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)

Ascend/pytorch
Paused

Content Risk Flag

Comments (1)

Ascend/pytorchPaused .gitee-modal { width: 500px !important; }

Content Risk Flag