ACL stream synchronize failed报错ez9999

(y5) root@baixin-1:/home/y5/Yolov5# python3 train.py --weights yolov5s.pt --cfg models/yolov5s.yaml
[W OperatorEntry.cpp:121] Warning: Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_has_compatible_shallow_copy_type(Tensor self, Tensor from) -> (bool)
    registered at /usr1/workspace/FPTA_Daily_Plugin_open/CODE/build/aten/src/ATen/RegisterSchema.cpp:20
  dispatch key: Math
  previous kernel: registered at /usr1/workspace/FPTA_Daily_Plugin_open/CODE/build/aten/src/ATen/RegisterMath.cpp:5686
       new kernel: registered at /usr1/workspace/FPTA_Daily_Plugin_open/Plugin/torch_npu/csrc/aten/ops/HasCompatibleShallowCopyType.cpp:37 (function registerKernel)
1p training
Using NPU 0 to train
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1     18879  models.yolo.Detect                      [2, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 270 layers, 7025023 parameters, 7025023 gradients, 16.0 GFLOPs

Transferred 342/349 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: NpuFusedSGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
combine_grad           : None
combine_ddp            : None
ddp_replica_count      : 4
check_combined_tensors : None
user_cast_preferred    : None
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : 128.0
combine_grad           : True
combine_ddp            : None
ddp_replica_count      : 4
check_combined_tensors : None
user_cast_preferred    : None
Use npu fused optimizer
WARNING: DP not recommended, use torch.distributed.run for best DDP Multi-GPU results.
See Multi-GPU Tutorial at https://github.com/ultralytics/yolov5/issues/475 to get started.
train: Scanning '/home/y5/datasets/coco128/labels/train2017_yolov5_v6.cache' images and labels... 100 found, 0 missing, 0 empty, 0 corrupt: 100%|█|
val: Scanning '/home/y5/datasets/coco128/labels/train2017_yolov5_v6.cache' images and labels... 100 found, 0 missing, 0 empty, 0 corrupt: 100%|█| 10
Plotting labels to runs/train/exp11/labels.jpg...

AutoAnchor: 4.66 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp11
Starting training for 50 epochs...

Epoch      step   gpu_mem       box       obj       cls    labels  img_size       FPS
EZ9999: Inner Error!
EZ9999  Kernel task happen error, retCode=0x28, [aicpu timeout].[FUNC:PreCheckTaskErr][FILE:task.cc][LINE:1064]
        Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=3351.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:773]
        Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=3351, fault op_name=Index[FUNC:GetError][FILE:stream.cc][LINE:921]
        rtStreamSynchronize execute failed, reason=[aicpu timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
        synchronize stream failed, runtime result = 507017[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
        Solution: Please contact support engineer.

DEVICE[0] PID[596958]:
EXCEPTION STREAM:
  Exception info:TGID=596958, model id=65535, stream id=3, stream phase=3
  Message info[0]:RTS_HWTS: Aicpu timeout, slot_id=17, stream_id=3, task_id=3350
    Other info[0]:time=2022-12-21-09:18:01.672.386, function=process_hwts_timeout_exception, line=3412, error code=0x28
Traceback (most recent call last):
  File "train.py", line 622, in <module>
    main(opt)
  File "train.py", line 520, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 342, in train
    loss, loss_items = compute_loss(pred, targets.to(device))  # loss scaled by batch_size
  File "/home/y5/Yolov5/utils/loss.py", line 138, in __call__
    tcls, tbox, indices, anchors, targets_mask, targets_sum_mask = self.build_targets(p, targets, self.model)  # targets
  File "/home/y5/Yolov5/utils/loss.py", line 267, in build_targets
    b = t.index_select(0, torch.tensor([0], device=targets.device)).long().view(-1)  # (3072 * 5)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/utils/device_guard.py", line 35, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch_npu/utils/torch_funcs.py", line 31, in _tensor
    return torch_npu.tensor(*args, **kwargs)
RuntimeError: ACL stream synchronize failed, error code:507017
THPModule_npu_shutdown success.

Ascend/pytorch
暂停

内容风险标识

评论 (5)

Ascend/pytorch暂停 .gitee-modal { width: 500px !important; }

内容风险标识