使用torch_npu多卡微调大模型。训练到一定steps之后，出现报错

一、问题现象（附报错日志上下文）：
观察transformer日志，基本上没啥有用信息。

观察昇腾plog日志，报错信息如下：
cat  /root/ascend/log/debug/plog/plog-426_20241111133946758.log
[ERROR] DRV(426,python3.9):2024-11-11-13:39:46.703.733 [ascend][curpid: 426, 499][drv][hdc][share_log_read_in_single_module 634]Server create success.(dev_id=0; fid=0; service_type="service_dump"; pid=1336055) Server create success.(dev_id=1; fid=0; service_type="service_dump"; pid=1336055) Server create success.(dev_id=2; fid=0; service_type="service_dump"; pid=1336055) Server create success.(dev_id=3; fid=0; service_type="service_dump"; pid=1336055)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.761.124 [ascend][curpid: 426, 426][drv][devmm][share_log_read_in_single_module 634]Setup device succeeded. (logical_devid=3; devid=3; vfid=0; hostpid=1336055; devpid=20088)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.761.637 [ascend][curpid: 426, 426][drv][devmm][devmm_virt_set_alloced_mem_struct 135]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x12d340000000; alloc_size=3703570432; advise=553648134)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.762.146 [ascend][curpid: 426, 426][drv][devmm][devmm_alloc_from_base_heap 168]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x12d340000000; alloc_size=3703570432; alloc_size=3703570432)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.762.152 [ascend][curpid: 426, 426][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x12d340000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.762.157 [ascend][curpid: 426, 426][drv][devmm][devmm_get_and_erase_alloced_mem_node 1077]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d340000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:49.762.162 [ascend][curpid: 426, 426][drv][devmm][devmm_free_mem 1147]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d340000000)
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:49.762.593 [npu_driver.cc:1096]426 DevMemAllocHugePageManaged:[INIT][DEFAULT][drv api] halMemAlloc failed:size=3703570432(Byte), type=0, moduleId=33, drvFlag=2377900603251770371, drvRetCode=6, device_id=3!
[ERROR] DRV(426,python3.9):2024-11-11-13:40:50.460.285 [ascend][curpid: 426, 426][drv][devmm][devmm_virt_set_alloced_mem_struct 135]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x12d340000000; alloc_size=3703570432; advise=553648130)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:52.511.967 [ascend][curpid: 426, 426][drv][devmm][devmm_alloc_from_base_heap 168]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x12d340000000; alloc_size=3703570432; alloc_size=3703570432)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:52.511.991 [ascend][curpid: 426, 426][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x12d340000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:52.511.997 [ascend][curpid: 426, 426][drv][devmm][devmm_get_and_erase_alloced_mem_node 1077]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d340000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:52.512.002 [ascend][curpid: 426, 426][drv][devmm][devmm_free_mem 1147]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d340000000)
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:52.513.005 [npu_driver.cc:1161]426 DevMemAllocManaged:[INIT][DEFAULT][drv api] halMemAlloc failed:size=3703570432(Byte), type=0, moduleId=33, drvFlag=2377900603251639299, drvRetCode=6, device_id=3!
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:52.513.102 [logger.cc:567]426 DevMalloc:[INIT][DEFAULT]Device malloc failed, size=3703570432(Byte), type=1024.
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:52.513.126 [api_c.cc:1148]426 rtMalloc:[INIT][DEFAULT]ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:52.513.144 [error_message_manage.cc:53]426 FuncErrorReason:[INIT][DEFAULT]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:52.513.152 [error_message_manage.cc:53]426 FuncErrorReason:[INIT][DEFAULT]rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] ASCENDCL(426,python3.9):2024-11-11-13:40:52.513.191 [memory.cpp:129]426 aclMallocMemInner: [INIT][DEFAULT]alloc device memory failed, runtime result = 207001
[ERROR] APP(426,python3.9):2024-11-11-13:40:52.513.211 [log_inner.cpp:76]426 compiler_depend.ts:malloc:941: "[PTA]:"Get a block from the existing pool failed. Try to free cached blocks and reallocate. This error log " "can be ignored.""
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.176.214 [ascend][curpid: 426, 426][drv][devmm][devmm_virt_set_alloced_mem_struct 135]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x12d440000000; alloc_size=4018143232; advise=553648134)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.176.756 [ascend][curpid: 426, 426][drv][devmm][devmm_alloc_from_base_heap 168]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x12d440000000; alloc_size=4018143232; alloc_size=4018143232)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.176.763 [ascend][curpid: 426, 426][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x12d440000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.176.769 [ascend][curpid: 426, 426][drv][devmm][devmm_get_and_erase_alloced_mem_node 1077]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d440000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.176.774 [ascend][curpid: 426, 426][drv][devmm][devmm_free_mem 1147]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d440000000)
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:57.177.372 [npu_driver.cc:1096]426 DevMemAllocHugePageManaged:[INIT][DEFAULT][drv api] halMemAlloc failed:size=4018143232(Byte), type=0, moduleId=33, drvFlag=2377900603251770371, drvRetCode=6, device_id=3!
[ERROR] DRV(426,python3.9):2024-11-11-13:40:57.626.342 [ascend][curpid: 426, 426][drv][devmm][devmm_virt_set_alloced_mem_struct 135]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x12d440000000; alloc_size=4018143232; advise=553648130)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:58.603.079 [ascend][curpid: 426, 426][drv][devmm][devmm_alloc_from_base_heap 168]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x12d440000000; alloc_size=4018143232; alloc_size=4018143232)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:58.603.102 [ascend][curpid: 426, 426][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x12d440000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:58.603.108 [ascend][curpid: 426, 426][drv][devmm][devmm_get_and_erase_alloced_mem_node 1077]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d440000000)
[ERROR] DRV(426,python3.9):2024-11-11-13:40:58.603.113 [ascend][curpid: 426, 426][drv][devmm][devmm_free_mem 1147]<errno:12, 6> Virtual address is not alloced, please check. (va=0x12d440000000)
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:58.604.789 [npu_driver.cc:1161]426 DevMemAllocManaged:[INIT][DEFAULT][drv api] halMemAlloc failed:size=4018143232(Byte), type=0, moduleId=33, drvFlag=2377900603251639299, drvRetCode=6, device_id=3!
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:58.604.887 [logger.cc:567]426 DevMalloc:[INIT][DEFAULT]Device malloc failed, size=4018143232(Byte), type=1024.
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:58.604.910 [api_c.cc:1148]426 rtMalloc:[INIT][DEFAULT]ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:58.604.915 [error_message_manage.cc:53]426 FuncErrorReason:[INIT][DEFAULT]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(426,python3.9):2024-11-11-13:40:58.604.924 [error_message_manage.cc:53]426 FuncErrorReason:[INIT][DEFAULT]rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] ASCENDCL(426,python3.9):2024-11-11-13:40:58.604.961 [memory.cpp:129]426 aclMallocMemInner: [INIT][DEFAULT]alloc device memory failed, runtime result = 207001
[ERROR] APP(426,python3.9):2024-11-11-13:40:58.604.981 [log_inner.cpp:76]426 compiler_depend.ts:malloc:941: "[PTA]:"Get a block from the existing pool failed. Try to free cached blocks and reallocate. This error log " "can be ignored.""
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.884.914 [engine.cc:4173]626 ProcLogicCqReport:Task run failed, device_id=3, stream_id=7, task_id=47755, sqe_type=0(ffts), errType=0x4(task timeout), sqSwStatus=0
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.885.335 [stream.cc:3384]626 EnterFailureAbort:stream_id=7 enter failure abort.
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.885.370 [task_info.cc:8504]626 SetStarsResultForFftsPlusTask:FftsPlusTask errorCode=273, logicCq:err=4, errCode=0, stream_id=7, task_id=47755
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.885.592 [task_info.cc:8469]626 DoCompleteSuccForFftsPlusTask:fftsplus report error, retCode=0x111, [fftsplus task timeout].
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.885.677 [stream.cc:1509]626 GetError:Stream Synchronize failed, stream_id=7, retCode=0x111, [fftsplus task timeout].
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.885.688 [task_info.cc:8426]626 TaskFailCallBackForFftsPlusTask:fftsplus streamId=7, taskId=572043, conetxt_id=65535, expandtype=1, rtCode=0x715006b,[fftsplus task timeout]
[ERROR] IDEDD(426,python3.9):2024-11-11-14:16:11.886.533 [dump_args.cpp:42][tid:626]>>> [Dump][Exception] in infoAddr[(nil)]|atomicIndex[0]|argAddr[(nil)]|argsize[0] has invalid attribute.
[ERROR] HCCL(426,python3.9):2024-11-11-14:16:11.909.778 [task_exception_handler.cc:424][626][TaskExceptionHandler][Callback]FFTS+ run failed, base information is streamID:[3964041712], taskID[572043], tag[ReduceScatter_100.122.177.194%eth0_60000_0_1731303644076655].
[ERROR] RUNTIME(426,python3.9):2024-11-11-14:16:11.909.840 [engine.cc:4058]626 StarsResumeRtsq:stop scheduling in abort failure mode: stream_id=7, sq_id=7, sq_head=643, task_id=47755, taskType=52.

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x):  
8.0.RC1.alpha003
--Tensorflow/Pytorch/MindSpore 版本:
torch-npu==2.1.0.post7
--Python 版本 (e.g., Python 3.7.5):
3.9.2
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04):
root@traine92abf874f5648a49647805885eb67dd-master-0:/home/HwHiAiUser# uname -m && cat /etc/*release
aarch64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

三、测试步骤：
运行环境，k8s容器化运行，

四、日志信息:
日志详见plog日志

五 pip list
Package                       Version
----------------------------- -------------------
absl-py                       2.1.0
accelerate                    0.33.0
addict                        2.4.0
aiofiles                      24.1.0
aiohappyeyeballs              2.3.6
aiohttp                       3.10.3
aiosignal                     1.3.1
annotated-types               0.7.0
anyio                         4.4.0
apex                          0.1-ascend-20240413
async-timeout                 4.0.3
attr                          0.3.2
attrs                         23.2.0
certifi                       2024.2.2
cffi                          1.16.0
charset-normalizer            3.3.2
click                         8.1.7
contourpy                     1.2.1
cycler                        0.12.1
datasets                      2.21.0
decorator                     5.1.1
deepspeed                     0.14.5
dill                          0.3.8
docstring_parser              0.16
einops                        0.8.0
exceptiongroup                1.2.2
fastapi                       0.112.1
ffmpy                         0.4.0
filelock                      3.15.4
fire                          0.6.0
fonttools                     4.51.0
frozenlist                    1.4.1
fsspec                        2024.6.1
gradio                        4.37.1
gradio_client                 1.3.0
h11                           0.14.0
hjson                         3.1.0
httpcore                      1.0.5
httpx                         0.27.0
huggingface-hub               0.24.5
idna                          3.7
importlib_metadata            7.1.0
importlib_resources           6.4.0
Jinja2                        3.1.4
kiwisolver                    1.4.5
llamafactory                  0.8.2
MarkupSafe                    2.1.5
matplotlib                    3.9.0
mmcv                          1.7.0
mpi4py                        4.0.0
mpmath                        1.3.0
msgpack                       1.1.0
multidict                     6.0.5
multipart                     0.1.1
multiprocess                  0.70.16
networkx                      3.2.1
ninja                         1.11.1.1
numpy                         1.26.4
nvidia-ml-py                  12.560.30
opencv-python                 4.9.0.80
orjson                        3.10.7
packaging                     24.0
pandas                        2.2.2
pathlib2                      2.3.7.post1
peft                          0.12.0
pillow                        10.3.0
pip                           24.0
platformdirs                  4.2.2
protobuf                      5.26.1
psutil                        5.9.8
py-cpuinfo                    9.0.0
pyarrow                       17.0.0
pycparser                     2.22
pydantic                      2.8.2
pydantic_core                 2.20.1
pydub                         0.25.1
pyparsing                     3.1.2
python-dateutil               2.9.0.post0
pytz                          2024.1
PyYAML                        6.0.1
regex                         2024.7.24
requests                      2.31.0
rich                          13.7.1
safetensors                   0.4.4
scipy                         1.13.0
semantic-version              2.10.0
sentencepiece                 0.2.0
setuptools                    49.2.1
shtab                         1.7.1
six                           1.16.0
sniffio                       1.3.1
sse-starlette                 2.1.3
starlette                     0.38.2
sympy                         1.12
termcolor                     2.4.0
tiktoken                      0.7.0
tokenizers                    0.19.1
tomli                         2.0.1
tomlkit                       0.13.2
torch                         2.1.0
torch-npu                     2.1.0.post7
torchvision                   0.12.0
tqdm                          4.66.5
transformers                  4.44.0
transformers-stream-generator 0.0.5
trl                           0.9.6
typer                         0.12.4
typing_extensions             4.11.0
tyro                          0.8.8
urllib3                       1.26.6
uvicorn                       0.30.6
websockets                    12.0
wheel                         0.44.0
xxhash                        3.4.1
yapf                          0.40.1
yarl                          1.9.4
zipp                          3.18.2

报错显示驱动报错，已联系对应工程师查看错误情况，如果有结果会在这回复

麻烦把/root/ascend/log下的日志和device日志提供下，日志获取可以查看置顶issue.同时提供下芯片信息

驱动和信息

+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2               | OK            | 96.3        40                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3358 / 65536         |
+===========================+===============+====================================================+

CANN
8.0.RC1.alpha003

麻烦把/root/ascend/log下的日志和device日志提供下，日志获取可以查看置顶issue

我昨晚重新跑了一次，torch-npu 升级为2.1.0.post8, 之前已经尝试过post3,post4,post7均有如上错误。
日志信息详见：https://gitee.com/wangXuPen/ascend-errorlog/blob/master/ascend-plog-20241129.zip

打屏日志能提供下吗？当前看了plog信息，感觉有些进程被kill了，导致信息不全。
目前提供的plog看到全量ffts+任务超时，能把elastic kill关掉再跑下看看，具体是哪里报错了吗

torch\distributed\elastic\agent\server\api.py
输入图片说明

torch\distributed\elastic\multiprocessing\api.py
输入图片说明

这块目前来看，没太明显的错误，就是显示某个进程挂了，然后任务结束了。

{'loss': 4.2566, 'grad_norm': 79.8895263671875, 'learning_rate': 4.885185185185185e-05, 'epoch': 0.08, 'num_input_tokens_seen': 43520}
  3%|▎         | 40/1350 [29:19<16:22:07, 44.98s/it][INFO|trainer.py:3829] 2024-11-26 21:53:44,560 >> 9, 'num_input_tokens_seen': 45808}
***** Running Evaluation *****                 
[INFO|trainer.py:3831] 2024-11-26 21:53:44,560 >>   Num examples = 400
[INFO|trainer.py:3834] 2024-11-26 21:53:44,560 >>   Batch size = 1

  3%|▎         | 40/1350 [30:04<16:22:07, 44.98s/it]635, 'eval_samples_per_second': 8.962, 'eval_steps_per_second': 1.12, 'epoch': 0.09, 'num_input_tokens_seen': 45808}
{'loss': 3.9987, 'grad_norm': 36.59013366699219, 'learning_rate': 4.8703703703703704e-05, 'epoch': 0.09, 'num_input_tokens_seen': 48040}
{'loss': 3.7563, 'grad_norm': 27.110736846923828, 'learning_rate': 4.862962962962963e-05, 'epoch': 0.1, 'num_input_tokens_seen': 50360}
{'loss': 3.7847, 'grad_norm': 37.74220275878906, 'learning_rate': 4.855555555555556e-05, 'epoch': 0.1, 'num_input_tokens_seen': 52664}
{'loss': 3.6238, 'grad_norm': 26.89166831970215, 'learning_rate': 4.848148148148148e-05, 'epoch': 0.11, 'num_input_tokens_seen': 55008}
{'loss': 3.5754, 'grad_norm': 22.43907356262207, 'learning_rate': 4.840740740740741e-05, 'epoch': 0.11, 'num_input_tokens_seen': 57328}
{'loss': 3.4844, 'grad_norm': 20.496408462524414, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.12, 'num_input_tokens_seen': 59632}
EE9999: Inner Error!
EE9999  Task execute failed, device_id=0, stream_id=6, task_id=34319, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
        TraceBack (most recent call last):
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
  4%|▍         | 52/1350 [38:57<16:15:14, 45.08s/it]Traceback (most recent call last):
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/sft/workflow.py", line 87, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 1948, in train
    return inner_training_loop(
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 3359, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/accelerator.py", line 2151, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2011, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2214, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1133, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1225, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1277, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1435, in partition_grads
    grad_buffer.copy_(grad_partition, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
EE9999: Inner Error!
EE9999  Task execute failed, device_id=1, stream_id=6, task_id=34295, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
        TraceBack (most recent call last):
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

这个知注释的作用是？增加以下timeout吧？

这块看起来就是超时报错了，之前提供的plog信息感觉不全，超时的原因没有看到，想把elastic kill关掉，让超时的原因报出来，然后提供下plog日志。

上面给的截图就是把对应代码注释掉，防止elastic报错时杀进程，错误信息没有上报。

您这边方便重跑下收集下日志信息吗？

Ascend/pytorch

内容风险标识

评论 (10)

驱动和信息

Ascend/pytorch .gitee-modal { width: 500px !important; }

内容风险标识