75 Star 580 Fork 1.1K

Ascend/pytorch

使用torch_npu多卡微调大模型。训练到一定steps之后,出现报错

DONE
缺陷
创建于  
2024-11-26 21:06

评论 (10)

wangXuPen 创建了缺陷 6个月前
wangXuPen 修改了描述 6个月前
展开全部操作日志

报错显示驱动报错,已联系对应工程师查看错误情况,如果有结果会在这回复

麻烦把/root/ascend/log下的日志和device日志提供下,日志获取可以查看置顶issue.同时提供下芯片信息

驱动和信息

+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2               | OK            | 96.3        40                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          3358 / 65536         |
+===========================+===============+====================================================+

CANN
8.0.RC1.alpha003

麻烦把/root/ascend/log下的日志和device日志提供下,日志获取可以查看置顶issue

我昨晚重新跑了一次,torch-npu 升级为2.1.0.post8, 之前已经尝试过post3,post4,post7均有如上错误。
日志信息详见:https://gitee.com/wangXuPen/ascend-errorlog/blob/master/ascend-plog-20241129.zip

打屏日志能提供下吗?当前看了plog信息,感觉有些进程被kill了,导致信息不全。
目前提供的plog看到全量ffts+任务超时,能把elastic kill关掉再跑下看看,具体是哪里报错了吗

torch\distributed\elastic\agent\server\api.py
输入图片说明
输入图片说明

torch\distributed\elastic\multiprocessing\api.py
输入图片说明
输入图片说明

这块目前来看,没太明显的错误,就是显示某个进程挂了,然后任务结束了。

{'loss': 4.2566, 'grad_norm': 79.8895263671875, 'learning_rate': 4.885185185185185e-05, 'epoch': 0.08, 'num_input_tokens_seen': 43520}
  3%|▎         | 40/1350 [29:19<16:22:07, 44.98s/it][INFO|trainer.py:3829] 2024-11-26 21:53:44,560 >> 9, 'num_input_tokens_seen': 45808}
***** Running Evaluation *****                 
[INFO|trainer.py:3831] 2024-11-26 21:53:44,560 >>   Num examples = 400
[INFO|trainer.py:3834] 2024-11-26 21:53:44,560 >>   Batch size = 1

  3%|▎         | 40/1350 [30:04<16:22:07, 44.98s/it]635, 'eval_samples_per_second': 8.962, 'eval_steps_per_second': 1.12, 'epoch': 0.09, 'num_input_tokens_seen': 45808}
{'loss': 3.9987, 'grad_norm': 36.59013366699219, 'learning_rate': 4.8703703703703704e-05, 'epoch': 0.09, 'num_input_tokens_seen': 48040}
{'loss': 3.7563, 'grad_norm': 27.110736846923828, 'learning_rate': 4.862962962962963e-05, 'epoch': 0.1, 'num_input_tokens_seen': 50360}
{'loss': 3.7847, 'grad_norm': 37.74220275878906, 'learning_rate': 4.855555555555556e-05, 'epoch': 0.1, 'num_input_tokens_seen': 52664}
{'loss': 3.6238, 'grad_norm': 26.89166831970215, 'learning_rate': 4.848148148148148e-05, 'epoch': 0.11, 'num_input_tokens_seen': 55008}
{'loss': 3.5754, 'grad_norm': 22.43907356262207, 'learning_rate': 4.840740740740741e-05, 'epoch': 0.11, 'num_input_tokens_seen': 57328}
{'loss': 3.4844, 'grad_norm': 20.496408462524414, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.12, 'num_input_tokens_seen': 59632}
EE9999: Inner Error!
EE9999  Task execute failed, device_id=0, stream_id=6, task_id=34319, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
        TraceBack (most recent call last):
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
  4%|▍         | 52/1350 [38:57<16:15:14, 45.08s/it]Traceback (most recent call last):
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/tuner.py", line 47, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/llamafactory/train/sft/workflow.py", line 87, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 1948, in train
    return inner_training_loop(
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 2289, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/transformers/trainer.py", line 3359, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/accelerator.py", line 2151, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2011, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2214, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1133, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1484, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1225, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1277, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/python3.9.2/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1435, in partition_grads
    grad_buffer.copy_(grad_partition, non_blocking=True)
RuntimeError: ACL stream synchronize failed.
EE9999: Inner Error!
EE9999  Task execute failed, device_id=1, stream_id=6, task_id=34295, flip_num=4, task_type=3(EVENT_WAIT).[FUNC:GetError][FILE:stream.cc][LINE:1512]
        TraceBack (most recent call last):
        rtStreamSynchronizeWithTimeout execute failed, reason=[task timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 107020[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

这个知注释的作用是? 增加以下timeout吧?

这块看起来就是超时报错了,之前提供的plog信息感觉不全,超时的原因没有看到,想把elastic kill关掉,让超时的原因报出来,然后提供下plog日志。

上面给的截图就是把对应代码注释掉,防止elastic报错时杀进程,错误信息没有上报。

您这边方便重跑下收集下日志信息吗?

huangyunlong 任务状态TODO 修改为WIP 6个月前
huangyunlong 任务状态WIP 修改为DONE 5个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(2)
huangyunlong-huangyunlong2022 wangXuPen-wangXuPen
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助