75 Star 580 Fork 1.1K

Ascend/pytorch

torch_npu跑多并发时报错:219 NPU error, error code is 500002

DONE
需求
创建于  
2024-04-09 21:04

环境910B单机16卡机器
输入图片说明
一、问题现象(附报错日志上下文):
对APE大模型进行3并发测试,报错。

(py39) root@gzxj-sys-rpm46kwprrx:~/APE# ./run_test.sh
/root/miniconda3/envs/py39/lib/python3.9/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
/root/miniconda3/envs/py39/lib/python3.9/site-packages/torchvision/transforms/functional_pil.py:5: UserWarning: The torchvision.transforms.functional_pil module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
[04/10 06:46:19 detectron2]: Arguments: Namespace(config_file='configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k.py', webcam=False, video_input=None, input=None, output=None, confidence_threshold=0.1, opts=['train.init_checkpoint=/root/APE/model_final.pth', 'model.model_language.cache_dir=', 'model.model_vision.select_box_nums_for_evaluation=500', 'model.model_vision.text_feature_bank_reset=True', 'model.model_vision.backbone.net.xattn=False'], text_prompt=None, with_box=True, with_mask=False, with_sseg=False)
Please 'pip install xformers'
Please 'pip install xformers'
Please 'pip install apex'
Please 'pip install xformers'
=========== args.opts ============ ['train.init_checkpoint=/root/APE/model_final.pth', 'model.model_language.cache_dir=', 'model.model_vision.select_box_nums_for_evaluation=500', 'model.model_vision.text_feature_bank_reset=True', 'model.model_vision.backbone.net.xattn=False']
ANTLR runtime and generated code versions disagree: 4.9.3!=4.8
ANTLR runtime and generated code versions disagree: 4.9.3!=4.8
======== shape of rope freq torch.Size([1024, 64]) ========
======== shape of rope freq torch.Size([4096, 64]) ========
[04/10 06:46:24 ape.data.detection_utils]: Using builtin metadata 'image_count' for dataset '['lvis_v1_train+coco_panoptic_separated']'
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: fed_loss_cls_weights: torch.Size([1203]) num_classes: 1256
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: pad fed_loss_cls_weights with type cat and value 0
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: pad fed_loss_classes with tensor([1203, 1204, 1205, 1206, 1207, 1208, 1209, 1210, 1211, 1212, 1213, 1214,
        1215, 1216, 1217, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1225, 1226,
        1227, 1228, 1229, 1230, 1231, 1232, 1233, 1234, 1235, 1236, 1237, 1238,
        1239, 1240, 1241, 1242, 1243, 1244, 1245, 1246, 1247, 1248, 1249, 1250,
        1251, 1252, 1253, 1254, 1255])
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: fed_loss_cls_weights: tensor([ 1.0000,  1.0000,  3.1623,  7.3485, 43.8520, 25.0998,  5.5678,  8.3066,
         2.6458,  3.3166,  1.0000,  5.4772,  7.0711,  6.7082,  5.2915, 10.6771,
        13.8924,  4.5826,  9.5394,  5.5678, 38.3275, 43.8634,  9.3274,  8.7750,
         3.3166,  6.8557,  4.5826,  6.8557,  8.3666, 42.8719,  4.3589, 23.0434,
         3.3166, 46.6798, 10.6301,  5.0990,  2.2361,  7.4833,  8.5440,  5.6569,
        11.3137, 24.9600,  3.4641,  7.2111,  3.3166, 41.0731,  9.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000,  0.0000,  0.0000])
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: fed_loss_cls_weights: torch.Size([1256]) num_classes: 1256
[04/10 06:46:24 ape.data.detection_utils]: Using builtin metadata 'image_count' for dataset '['openimages_v6_train_bbox_nogroup']'
[04/10 06:46:24 ape.modeling.ape_deta.deformable_criterion]: fed_loss_cls_weights: torch.Size([601]) num_classes: 601
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 0
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: lvis_v1_train+coco_panoptic_separated
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing+stuff
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 1
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: objects365_train_fixname
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 2
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: openimages_v6_train_bbox_nogroup
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 3
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: visualgenome_77962_box_and_region
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 4
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: sa1b_6m
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 5
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: refcoco-mixed_group-by-image
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 6
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: gqa_region_train
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 7
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: phrasecut_train
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 8
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: flickr30k_separateGT_train
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_id: 9
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_name: refcoco-mixed
[04/10 06:46:24 ape.modeling.ape_deta.deformable_detr]: dataset_entity: thing
[04/10 06:47:13 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /root/APE/model_final.pth ...
[04/10 06:47:13 fvcore.common.checkpoint]: [Checkpointer] Loading from /root/APE/model_final.pth ...
Namespace(config_file='configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k.py', webcam=False, video_input=None, input=None, output=None, confidence_threshold=0.1, opts=['train.init_checkpoint=/root/APE/model_final.pth', 'model.model_language.cache_dir=', 'model.model_vision.select_box_nums_for_evaluation=500', 'model.model_vision.text_feature_bank_reset=True', 'model.model_vision.backbone.net.xattn=False'], text_prompt=None, with_box=True, with_mask=False, with_sseg=False)
INFO:     Started server process [75357]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8198 (Press CTRL+C to quit)
/root/APE/ape/modeling/text/clip_wrapper_eva02.py:117: UserWarning: AutoNonVariableTypeMode is deprecated and will be removed in 1.10 release. For kernel implementations please use AutoDispatchBelowADInplaceOrView instead, If you are looking for a user facing API to enable running your inference-only workload, please use c10::InferenceMode. Using AutoDispatchBelowADInplaceOrView in user code is under risk of producing silent wrong result in some edge cases. See Note [AutoDispatchBelowAutograd] for more details. (Triggered internally at /opt/_internal/cpython-3.9.0/lib/python3.9/site-packages/torch/include/ATen/core/LegacyTypeDispatch.h:74.)
  attention_mask[i, : end_token_idx[i] + 1] = 1
/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[WARNING] nms proposals (0) < 900, running naive topk
[WARNING] nms proposals (0) < 900, running naive topk
INFO:     10.92.54.160:60802 - "POST /infer HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/root/APE/demo/api.py", line 144, in interface
    predictions, visualized_output, visualized_outputs, metadata = demo.run_on_image(
  File "/root/APE/demo/predictor_lazy.py", line 212, in run_on_image
    predictions = self.predictor(image, text_prompt, mask_prompt)
  File "/root/APE/ape/engine/defaults.py", line 99, in __call__
    predictions = self.model([inputs])[0]
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/ape_deta.py", line 39, in forward
    losses = self.model_vision(batched_inputs, do_postprocess=do_postprocess)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/deformable_detr_segm_vl.py", line 428, in forward
    ) = self.transformer(
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/deformable_transformer_vl.py", line 605, in forward
    keep_inds_topk = keep_inds[keep_inds_mask]
RuntimeError: InnerRun:/usr1/02/workspace/j_ywhtRpPk/pytorch/torch_npu/csrc/framework/OpParamMaker.cpp:219 NPU error, error code is 500002
[Error]: A GE error occurs in the system.
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
EH9999  [Exec][Op]Execute op failed. op type = NonZero, ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):

[WARNING] nms proposals (0) < 900, running naive topk
INFO:     10.92.54.160:60898 - "POST /infer HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/fastapi/routing.py", line 193, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/starlette/concurrency.py", line 42, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/root/APE/demo/api.py", line 144, in interface
    predictions, visualized_output, visualized_outputs, metadata = demo.run_on_image(
  File "/root/APE/demo/predictor_lazy.py", line 212, in run_on_image
    predictions = self.predictor(image, text_prompt, mask_prompt)
  File "/root/APE/ape/engine/defaults.py", line 99, in __call__
    predictions = self.model([inputs])[0]
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/ape_deta.py", line 39, in forward
    losses = self.model_vision(batched_inputs, do_postprocess=do_postprocess)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/deformable_detr_segm_vl.py", line 428, in forward
    ) = self.transformer(
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/APE/ape/modeling/ape_deta/deformable_transformer_vl.py", line 605, in forward
    keep_inds_topk = keep_inds[keep_inds_mask]
RuntimeError: InnerRun:/usr1/02/workspace/j_ywhtRpPk/pytorch/torch_npu/csrc/framework/OpParamMaker.cpp:219 NPU error, error code is 500002
[Error]: A GE error occurs in the system.
        Rectify the fault based on the error information in the log, or you can ask us at follwing gitee link by issues: https://gitee.com/ascend/pytorch/issue
EH9999: Inner Error!
EH9999  [Exec][Op]Execute op failed. op type = NonZero, ge result = 4294967295[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):

[04/10 06:53:56 detectron2]: ./tmp/1712731968.7509215.jpg: detected 7 instances in 67.40s
INFO:     10.92.54.160:60698 - "POST /infer HTTP/1.1" 200 OK
[WARNING] nms proposals (0) < 900, running naive topk
[WARNING] nms proposals (0) < 900, running naive topk
[WARNING] nms proposals (0) < 900, running naive topk
[04/10 06:54:44 detectron2]: ./tmp/1712732025.6573904.jpg: detected 14 instances in 58.55s
INFO:     10.92.54.160:35880 - "POST /infer HTTP/1.1" 200 OK
[04/10 06:54:45 detectron2]: ./tmp/1712732026.2869112.jpg: detected 14 instances in 59.58s
INFO:     10.92.54.160:36146 - "POST /infer HTTP/1.1" 200 OK
[04/10 06:54:56 detectron2]: ./tmp/1712732039.161774.jpg: detected 14 instances in 57.36s
INFO:     10.92.54.160:36822 - "POST /infer HTTP/1.1" 200 OK
[WARNING] nms proposals (0) < 900, running naive topk
[WARNING] nms proposals (0) < 900, running naive topk
[WARNING] nms proposals (0) < 900, running naive topk
[04/10 06:55:44 detectron2]: ./tmp/1712732085.083786.jpg: detected 14 instances in 59.38s
INFO:     10.92.54.160:39896 - "POST /infer HTTP/1.1" 200 OK
EH9999: Inner Error!
        rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
EH9999  synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):

E90003: Compile operator failed, cause: Template constraint, detailed information: check_op_cap func failed, check_type: op_select_format, op_type:LinSpace failed, failure details:
Compile_info: empty_compile_info
Inputs: {'shape': (1,), 'ori_shape': (), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'ddr_base_prop': 0, 'total_shape': [1], 'slice_offset': (), 'L1_addr_offset': 0, 'L1_fusion_type': -1, 'L1_workspace_size': -1, 'valid_shape': (), 'split_index': 0, 'is_first_layer': False, 'range': (), 'ori_range': (), 'atomic_type': '', 'input_c_values': -1}
{'shape': (1,), 'ori_shape': (), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'ddr_base_prop': 0, 'total_shape': [1], 'slice_offset': (), 'L1_addr_offset': 0, 'L1_fusion_type': -1, 'L1_workspace_size': -1, 'valid_shape': (), 'split_index': 0, 'is_first_layer': False, 'range': (), 'ori_range': (), 'atomic_type': '', 'input_c_values': -1}
{'shape': (1,), 'ori_shape': (1,), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'int32', 'addr_type': 0, 'ddr_base_prop': 0, 'total_shape': [1], 'slice_offset': (), 'L1_addr_offset': 0, 'L1_fusion_type': -1, 'L1_workspace_size': -1, 'valid_shape': (), 'split_index': 0, 'is_first_layer': False, 'range': (), 'ori_range': (), 'atomic_type': '', 'input_c_values': -1}
Outputs: {'shape': (-2,), 'ori_shape': (-2,), 'format': 'ND', 'sub_format': 0, 'ori_format': 'ND', 'dtype': 'float32', 'addr_type': 0, 'ddr_base_prop': 0, 'total_shape': [-2], 'slice_offset': (), 'L1_addr_offset': 0, 'L1_fusion_type': -1, 'L1_workspace_size': -1, 'valid_shape': (), 'split_index': 0, 'range': (), 'ori_range': (), 'atomic_type': '', 'input_c_values': -1}
Attrs: [].
        TraceBack (most recent call last):
        The error from device(chipId:1, dieId:0), serial number is 12, there is an aivec error exception, core id is 34, error code = 0x800000, dump info: pc start: 0x1240c140d0a8, current: 0x1240c140d300, vec error info: 0xd115465300, mte error info: 0x3403000096, ifu error info: 0x23c9f37f1f880, ccu error info: 0x5f082e8a43b6e9e3, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x1241803e6400.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1165]
        The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x3000096, fixp_error1 info: 0x34 fsmId:1, tslot:0, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1177]
        Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:task_info.cc][LINE:1677]
        AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1454]
        Aicore kernel execute failed, device_id=1, stream_id=2, report_stream_id=2, task_id=49049, flip_num=0, fault kernel_name=Cast_e87590d11ccda8b259ab6b1ea7212319_high_performance_210000000, program id=121, hash=3394887288916785353.[FUNC:GetError][FILE:stream.cc][LINE:1454]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1454]
        rtStreamSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
        Assert ((rt_ret) == 0) failed[FUNC:DoRtStreamSyncWithTimeout][FILE:utils.cc][LINE:40]
        [Exec][Op]Execute op failed. op type = NonMaxSuppressionV3, ge result = 1343225857[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

./run_test.sh: line 54: 75357 Aborted                 (core dumped) python3.9 demo/api.py --config-file configs/LVISCOCOCOCOSTUFF_O365_OID_VGR_SA1B_REFCOCO_GQA_PhraseCut_Flickr30k/ape_deta/ape_deta_vitl_eva02_clip_vlf_lsj1024_cp_16x4_1080k.py --with-box --opts train.init_checkpoint="/root/APE/model_final.pth" model.model_language.cache_dir="" model.model_vision.select_box_nums_for_evaluation=500 model.model_vision.text_feature_bank_reset=True model.model_vision.backbone.net.xattn=False

二、软件版本:
输入图片说明
-- CANN 版本: 7.0.RC1.10
-- Python 版本:3.9
-- 操作系统版本: Ubuntu 18.04
-- arch : x86_64

三、测试步骤
在910b上适配了APE大模型,并使用fastapi 代码进行测试:

# Copyright (c) Facebook, Inc. and its affiliates.
import argparse
import json
import multiprocessing as mp
import os
import tempfile
import time
import warnings
from collections import abc

import sys
import numpy as np
import tqdm

import torch
import torch_npu
from detectron2.config import LazyConfig, get_cfg
from detectron2.data.detection_utils import read_image
from detectron2.evaluation.coco_evaluation import instances_to_coco_json

# from detectron2.projects.deeplab import add_deeplab_config
# from detectron2.projects.panoptic_deeplab import add_panoptic_deeplab_config
from detectron2.utils.logger import setup_logger
from predictor_lazy import VisualizationDemo

import base64
import io
import gc
import uvicorn
import requests
from ctypes import *
from PIL import Image
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

# constants
WINDOW_NAME = "APE"

def setup_cfg(args):
    # load config from file and command-line arguments
    cfg = LazyConfig.load(args.config_file)
    print ("=========== args.opts ============", args.opts)
    cfg = LazyConfig.apply_overrides(cfg, args.opts)

    if "output_dir" in cfg.model:
        cfg.model.output_dir = cfg.train.output_dir
    if "model_vision" in cfg.model and "output_dir" in cfg.model.model_vision:
        cfg.model.model_vision.output_dir = cfg.train.output_dir
    if "train" in cfg.dataloader:
        if isinstance(cfg.dataloader.train, abc.MutableSequence):
            for i in range(len(cfg.dataloader.train)):
                if "output_dir" in cfg.dataloader.train[i].mapper:
                    cfg.dataloader.train[i].mapper.output_dir = cfg.train.output_dir
        else:
            if "output_dir" in cfg.dataloader.train.mapper:
                cfg.dataloader.train.mapper.output_dir = cfg.train.output_dir

    if "model_vision" in cfg.model:
        cfg.model.model_vision.test_score_thresh = args.confidence_threshold
    else:
        cfg.model.test_score_thresh = args.confidence_threshold

    # default_setup(cfg, args)

    setup_logger(name="ape")
    setup_logger(name="timm")

    return cfg


def get_parser():
    parser = argparse.ArgumentParser(description="Detectron2 demo for builtin configs")
    parser.add_argument(
        "--config-file",
        default="configs/quick_schedules/mask_rcnn_R_50_FPN_inference_acc_test.yaml",
        metavar="FILE",
        help="path to config file",
    )
    parser.add_argument("--webcam", action="store_true", help="Take inputs from webcam.")
    parser.add_argument("--video-input", help="Path to video file.")
    parser.add_argument(
        "--input",
        nargs="+",
        help="A list of space separated input images; "
        "or a single glob pattern such as 'directory/*.jpg'",
    )
    parser.add_argument(
        "--output",
        help="A file or directory to save output visualizations. "
        "If not given, will show output in an OpenCV window.",
    )

    parser.add_argument(
        "--confidence-threshold",
        type=float,
        default=0.1,
        help="Minimum score for instance predictions to be shown",
    )
    parser.add_argument(
        "--opts",
        help="Modify config options using the command-line 'KEY VALUE' pairs",
        default=[],
        nargs=argparse.REMAINDER,
    )

    parser.add_argument("--text-prompt", default=None)

    parser.add_argument("--with-box", action="store_true", help="show box of instance")
    parser.add_argument("--with-mask", action="store_true", help="show mask of instance")
    parser.add_argument("--with-sseg", action="store_true", help="show mask of class")

    return parser

class Req(BaseModel):
    image: str
    text: str

@app.post('/infer')
def interface(req: Req):
    image, text_prompt = req.image, req.text
    if not image or not text_prompt or '<' in text_prompt or '>' in text_prompt:
            return {"error": "input error"}
    image = Image.open(io.BytesIO(base64.b64decode(image))).convert("RGB")
    fn = time.time()
    try:
        images = []
        os.makedirs('./tmp', exist_ok=True)
        image_path = f"./tmp/{fn}.jpg"
        image.save(image_path)
        images.append(image_path)

        for path in tqdm.tqdm(images, disable=not args.output):
            # use PIL, to be consistent with evaluation
            try:
                img = read_image(path, format="BGR")
            except Exception as e:
                print("*" * 60)
                print("fail to open image: ", e)
                print("*" * 60)
                continue
            start_time = time.time()
            predictions, visualized_output, visualized_outputs, metadata = demo.run_on_image(
                img,
                text_prompt=text_prompt,
                with_box=args.with_box,
                with_mask=args.with_mask,
                with_sseg=args.with_sseg,
            )
            logger.info(
                "{}: {} in {:.2f}s".format(
                    path,
                    "detected {} instances".format(len(predictions["instances"]))
                    if "instances" in predictions
                    else "finished",
                    time.time() - start_time,
                )
            )
            results = []
            if "instances" in predictions:
                results = instances_to_coco_json(
                    predictions["instances"].to(demo.cpu_device), path
                )
                for result in results:
                    result["category_name"] = metadata.thing_classes[result["category_id"]]
                    result["image_name"] = result["image_id"]

            if args.output:
                os.makedirs(args.output, exist_ok=True)
                if os.path.isdir(args.output):
                    assert os.path.isdir(args.output), args.output
                    out_filename = os.path.join(args.output, os.path.basename(path))
                else:
                    assert len(args.input) == 1, "Please specify a directory with args.output"
                    out_filename = args.output
                out_filename = out_filename.replace(".webp", ".png")
                out_filename = out_filename.replace(".crdownload", ".png")
                out_filename = out_filename.replace(".jfif", ".png")
                visualized_output.save(out_filename)

                for i in range(len(visualized_outputs)):
                    out_filename = (
                        os.path.join(args.output, os.path.basename(path)) + "." + str(i) + ".png"
                    )
                    visualized_outputs[i].save(out_filename)

                with open(out_filename + ".json", "w") as outp:
                    json.dump(results, outp)
        gc.collect()
    finally:
        os.remove(f'./tmp/{fn}.jpg')
    return {'result': results}

if __name__ == "__main__":
    import setproctitle
    setproctitle.setproctitle("APE")
    torch.npu.set_device('npu:1')

    # init model
    mp.set_start_method("spawn", force=True)
    args = get_parser().parse_args()
    setup_logger(name="fvcore")
    setup_logger(name="ape")
    logger = setup_logger()
    logger.info("Arguments: " + str(args))

    cfg = setup_cfg(args)
    demo = VisualizationDemo(cfg, args=args)

    uvicorn.run(app, port=8198, host="0.0.0.0")

启动该脚本后,通过jmter发压,在1并发和2并发时无异常,3并发之后开始报错,程序崩溃。

评论 (4)

Agent-Chu 创建了需求 1年前
Agent-Chu 修改了描述 1年前
Agent-Chu 修改了描述 1年前
展开全部操作日志

可以提供下plog日志吗

Agent-Chu 修改了描述 1年前

这是 ~/ascend/log/debug/plog 下的日志:

[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.281.165 [msprof_callback_impl.cpp:324] >>> (tid:75357) Started to register profiling ctrl callback.
[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.281.241 [msprof_callback_impl.cpp:331] >>> (tid:75357) Started to register profiling hash id callback.
[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.281.244 [prof_atls_plugin.cpp:83] >>> (tid:75357) RegisterProfileCallback, callback type is 7
[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.281.246 [msprof_callback_impl.cpp:338] >>> (tid:75357) Started to register profiling enable host freq callback.
[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.281.248 [prof_atls_plugin.cpp:83] >>> (tid:75357) RegisterProfileCallback, callback type is 8
[EVENT] RUNTIME(75357,python3.9):2024-04-10-06:46:13.326.030 [runtime.cc:4300] 75357 GetVisibleDevices: ASCEND_RT_VISIBLE_DEVICES param was not set
[EVENT] PROFILING(75357,python3.9):2024-04-10-06:46:13.326.609 [prof_atls_plugin.cpp:160] >>> (tid:75357) Module[7] register callback of ctrl handle.
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.262.283 [prof_atls_plugin.cpp:160] >>> (tid:75357) Module[48] register callback of ctrl handle.
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.262.792 [prof_atls_plugin.cpp:160] >>> (tid:75357) Module[45] register callback of ctrl handle.
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.331.890 [prof_atls_plugin.cpp:160] >>> (tid:75357) Module[6] register callback of ctrl handle.
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.515.199 [msprof_callback_impl.cpp:79] >>> (tid:75357) MsprofCtrlCallback called, type: 255
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.515.328 [ai_drv_dev_api.cpp:333] >>> (tid:75357) Succeeded to DrvGetApiVersion version: 0x71a09
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:46:15.735.923 [device.cc:341] 75357 Init: isDoubledie:0, topologytype:0
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:46:15.748.754 [npu_driver.cc:5301] 75487 GetDeviceStatus: GetDeviceStatus status=1.
[TRACE] GE(75357,cqy-APE):2024-04-10-06:46:15.752.580 [status:INIT] [ge_api.cc:206]75357 GEInitializeImpl:GEInitialize start
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.807.632 [msprof_callback_impl.cpp:79] >>> (tid:75357) MsprofCtrlCallback called, type: 255
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:46:15.807.643 [ai_drv_dev_api.cpp:333] >>> (tid:75357) Succeeded to DrvGetApiVersion version: 0x71a09
[TRACE] GE(75357,cqy-APE):2024-04-10-06:46:15.834.967 [status:RUNNING] [ge_api.cc:270]75357 GEInitializeImpl:Initializing environment
[EVENT] TUNE(75357,cqy-APE):2024-04-10-06:46:16.106.948 [cann_kb_pyfunc_mgr.cpp:72][CANNKB][Tid:75357]"CannKbPyfuncMgr: Enter PyObjectInit, reference_ is 0!"
[EVENT] TUNE(75357,cqy-APE):2024-04-10-06:46:16.106.966 [handle_manager.cpp:115][CANNKB][Tid:75357]"Start to run init functions to load dynamic python lib!"
[EVENT] TUNE(75357,cqy-APE):2024-04-10-06:46:16.107.027 [handle_manager.cpp:407][CANNKB][Tid:75357]"Init functions of loading dynamic python lib end!"
[EVENT] TUNE(75357,cqy-APE):2024-04-10-06:46:16.107.032 [cann_kb_pyfunc_mgr.cpp:24][CANNKB][Tid:75357]"CANN_KB_Py has already been initialized."
[EVENT] TUNE(75357,cqy-APE):2024-04-10-06:46:16.911.819 [cann_kb_pyfunc_mgr.cpp:117][CANNKB][Tid:75357]"CannKbPyfuncMgr: Run PyObjectInit successfully!"
[EVENT] TBE(75357,cqy-APE):2024-04-10-06:46:16.979.829 [../../../../../../latest/python/site-packages/tbe/common/repository_manager/utils/repository_manager_log.py:30][log] [../../../../../../latest/python/site-packages/tbe/common/repository_manager/route.py:312][repository_manager] get_compiler_core_num core_num = [8].
[EVENT] TBE(75357,cqy-APE):2024-04-10-06:46:18.250.012 [../../../../../../latest/python/site-packages/te_fusion/parallel_compilation.py:552][init] The time cost of random buffer compile is [84758] micro second.
[EVENT] TBE(75357,cqy-APE):2024-04-10-06:46:18.262.724 [../../../../../../latest/python/site-packages/te_fusion/parallel_compilation.py:552][init] The time cost of random buffer compile is [5776] micro second.
[TRACE] GE(75357,cqy-APE):2024-04-10-06:46:19.208.553 [status:STOP] [ge_api.cc:313]75357 GEInitializeImpl:GEInitialize finished
[EVENT] PROFILING(75357,cqy-APE):2024-04-10-06:47:13.517.563 [prof_atls_plugin.cpp:160] >>> (tid:75357) Module[61] register callback of ctrl handle.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.077.231 [logger.cc:1046] 75975 ModelBindStream: model_id=320, stream_id=1024, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.078.393 [logger.cc:1059] 75975 ModelUnbindStream: model_id=320, stream_id=1024,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.265.095 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.266.025 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.631.236 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.632.042 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.265.095 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.266.025 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.631.236 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.632.042 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.813.736 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:49.814.654 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.041.245 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.042.079 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.186.111 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.186.886 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.300.641 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.301.319 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.389.288 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.389.956 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.513.523 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.514.358 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.604.153 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.604.877 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.933.862 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:52:50.934.377 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:39.686.132 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:39.687.823 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:42.312.607 [logger.cc:1046] 75977 ModelBindStream: model_id=1856, stream_id=1408, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:42.313.745 [logger.cc:1059] 75977 ModelUnbindStream: model_id=1856, stream_id=1408,
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:42.326.588 [logger.cc:1046] 76243 ModelBindStream: model_id=1856, stream_id=1216, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:42.327.475 [logger.cc:1059] 76243 ModelUnbindStream: model_id=1856, stream_id=1216,
[ERROR] GE(75357,cqy-APE):2024-04-10-06:53:42.329.551 [infer_shape.cc:223]76243 SetOutputShape: ErrorNo: 4294967295(failed) [FINAL][FINAL][SetOutputShape][SetOutputShape_NonZero13_4430]Node[NonZero13] output[0] dim_num=[4294967295] is greater than MaxDimNum[8]
[ERROR] ASCENDCL(75357,cqy-APE):2024-04-10-06:53:42.329.637 [op_executor.cpp:377]76243 DoExecuteAsync: [FINAL][FINAL][Exec][Op]Execute op failed. op type = NonZero, ge result = 4294967295
[ERROR] GE(75357,cqy-APE):2024-04-10-06:53:45.779.732 [infer_shape.cc:223]76426 SetOutputShape: ErrorNo: 4294967295(failed) [SetOutputShape][SetOutputShape_NonZero13_4430]Node[NonZero13] output[0] dim_num=[4294967295] is greater than MaxDimNum[8]
[ERROR] ASCENDCL(75357,cqy-APE):2024-04-10-06:53:45.779.749 [op_executor.cpp:377]76426 DoExecuteAsync: [Exec][Op]Execute op failed. op type = NonZero, ge result = 4294967295
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:45.997.279 [logger.cc:1046] 75975 ModelBindStream: model_id=1600, stream_id=832, flag=0.
[EVENT] RUNTIME(75357,cqy-APE):2024-04-10-06:53:45.998.073 [logger.cc:1059] 75975 ModelUnbindStream: model_id=1600, stream_id=832,

日志显示算子报错,能抽出最小复现用例吗?

Agent-Chu 修改了描述 1年前
huangyunlong 任务状态TODO 修改为WIP 1年前
huangyunlong 任务状态WIP 修改为DONE 11个月前

请问这个问题解决了吗,我也遇到类似的情况

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(3)
huangyunlong-huangyunlong2022 10230415 agent chu 1640318567 cord-fuckandfight
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助