报错显示驱动报错,已联系对应工程师查看错误情况,如果有结果会在这回复
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
麻烦把/root/ascend/log下的日志和device日志提供下,日志获取可以查看置顶issue.同时提供下芯片信息
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2 | OK | 96.3 40 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3358 / 65536 |
+===========================+===============+====================================================+
CANN
8.0.RC1.alpha003
麻烦把/root/ascend/log下的日志和device日志提供下,日志获取可以查看置顶issue
我昨晚重新跑了一次,torch-npu 升级为2.1.0.post8, 之前已经尝试过post3,post4,post7均有如上错误。
日志信息详见:https://gitee.com/wangXuPen/ascend-errorlog/blob/master/ascend-plog-20241129.zip
打屏日志能提供下吗?当前看了plog信息,感觉有些进程被kill了,导致信息不全。
目前提供的plog看到全量ffts+任务超时,能把elastic kill关掉再跑下看看,具体是哪里报错了吗
torch\distributed\elastic\agent\server\api.py
torch\distributed\elastic\multiprocessing\api.py
这块目前来看,没太明显的错误,就是显示某个进程挂了,然后任务结束了。
{'loss': 4.2566, 'grad_norm': 79.8895263671875, 'learning_rate': 4.885185185185185e-05, 'epoch': 0.08, 'num_input_tokens_seen': 43520}
3%|▎ | 40/1350 [29:19<16:22:07, 44.98s/it][INFO|trainer.py:3829] 2024-11-26 21:53:44,560 >> 9, 'num_input_tokens_seen': 45808}
***** Running Evaluation *****
[INFO|trainer.py:3831] 2024-11-26 21:53:44,560 >> Num examples = 400
[INFO|trainer.py:3834] 2024-11-26 21:53:44,560 >> Batch size = 1
3%|▎ | 40/1350 [30:04<16:22:07, 44.98s/it]635, 'eval_samples_per_second': 8.962, 'eval_steps_per_second': 1.12, 'epoch': 0.09, 'num_input_tokens_seen': 45808}
{'loss': 3.9987, 'grad_norm': 36.59013366699219, 'learning_rate': 4.8703703703703704e-05, 'epoch': 0.09, 'num_input_tokens_seen': 48040}
{'loss': 3.7563, 'grad_norm': 27.110736846923828, 'learning_rate': 4.862962962962963e-05, 'epoch': 0.1, 'num_input_tokens_seen': 50360}
{'loss': 3.7847, 'grad_norm': 37.74220275878906, 'learning_rate': 4.855555555555556e-05, 'epoch': 0.1, 'num_input_tokens_seen': 52664}
{'loss': 3.6238, 'grad_norm': 26.89166831970215, 'learning_rate': 4.848148148148148e-05, 'epoch': 0.11, 'num_input_tokens_seen': 55008}
{'loss': 3.5754, 'grad_norm': 22.43907356262207, 'learning_rate': 4.840740740740741e-05, 'epoch': 0.11, 'num_input_tokens_seen': 57328}
{'loss': 3.4844, 'grad_norm': 20.496408462524414, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.12, 'num_input_tokens_seen': 59632}
EE9999: Inner Error!
EE9999 Task execute failed, device_id=0, stream_id=6, task_id=34319, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
TraceBack (most recent call last):
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
4%|▍ | 52/1350 [38:57<16:15:14, 45.08s/it]Traceback (most recent call last):
File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 23, in <module>
launch()
File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/tuner.py", line 47, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/sft/workflow.py", line 87, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 1948, in train
return inner_training_loop(
File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 3359, in training_step
self.accelerator.backward(loss, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/accelerator.py", line 2151, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2011, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2214, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1133, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1225, in reduce_independent_p_g_buckets_and_remove_grads
self.__reduce_and_partition_ipg_grads()
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1277, in __reduce_and_partition_ipg_grads
self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1435, in partition_grads
grad_buffer.copy_(grad_partition, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
EE9999: Inner Error!
EE9999 Task execute failed, device_id=1, stream_id=6, task_id=34295, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
TraceBack (most recent call last):
rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
这个知注释的作用是? 增加以下timeout吧?
这块看起来就是超时报错了,之前提供的plog信息感觉不全,超时的原因没有看到,想把elastic kill关掉,让超时的原因报出来,然后提供下plog日志。
上面给的截图就是把对应代码注释掉,防止elastic报错时杀进程,错误信息没有上报。
您这边方便重跑下收集下日志信息吗?
登录 后才可以发表评论