微调sd3.5报错

- accelerate.accelerator - Saving current state to /opt/stable-diffusion-3.5/diffusers/logs/checkpoint-500
12/13/2024 15:09:01 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
EI0006: [PID: 19049] 2024-12-13-15:29:01.296.184 Getting socket times out. Reason: Remote Rank did not send the data in time. Please check the reason for the rank being stuck
        Solution: 1. Check the rank service processes with other errors or no errors in the cluster.2. If this error is reported for all NPUs, check whether the time difference between the earliest and latest errors is greater than the connect timeout interval (120s by default). If so, adjust the timeout interval by using the HCCL_CONNECT_TIMEOUT environment variable.3. Check the connectivity of the communication link between nodes. (For details, see the TLS command and HCCN connectivity check examples.). For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[2]-remoteUserrank[2]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[1]-remoteUserrank[1]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[3]-remoteUserrank[3]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[4]-remoteUserrank[4]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[5]-remoteUserrank[5]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[6]-remoteUserrank[6]-remote_ip_addr[172.17.0.5]
        Transport init error. Reason: [Create][DestLink]Create Dest error! createLink para:rank[0]-localUserrank[0]-localIpAddr[172.17.0.5], dst_rank[7]-remoteUserrank[7]-remote_ip_addr[172.17.0.5]

Traceback (most recent call last):
  File "/opt/stable-diffusion-3.5/diffusers/./examples/dreambooth/train_dreambooth_lora_sd3.py", line 1934, in <module>
    main(args)
  File "/opt/stable-diffusion-3.5/diffusers/./examples/dreambooth/train_dreambooth_lora_sd3.py", line 1817, in main
    accelerator.save_state(save_path)
  File "/usr/local/python3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
    model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3137, in save_checkpoint
    dist.barrier()
  File "/usr/local/python3.10/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 408, in barrier
    return cdb.barrier(group=group, async_op=async_op)
  File "/usr/local/python3.10/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 330, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/usr/local/python3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3703, in barrier
    work.wait()
RuntimeError: The Inner error is reported as above.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[ERROR] 2024-12-13-15:29:01 (PID:19049, Device:0, RankID:0) ERR00100 PTA call acl api failed
^MSteps:  25%|▒~V~H▒~V~H▒~V~L       | 500/2000 [33:06<1:39:18,  3.97s/it, loss=0.104, lr=1e-5]
Inner error, see details in Ascend logs.Inner error, see details in Ascend logs.EI0002: [PID: 19054] 2024-12-13-15:39:42.626.496 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [unknown].base information: [streamID:[2550120140], taskID[10817], tag[AllReduce_172.17.0.5%eth0_60000_0_1734072939759954], AlgType(level 0-1-2):[fullmesh-ring-ring].] task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.17.0.5/0], Arrival Time:[Fri Dec 13 15:27:15 2024], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
        TraceBack (most recent call last):
        The error from device(chipId:5, dieId:0), serial number is 14, hccl fftsplus task timeout occurred during task execution, stream_id:4, sq_id:4, task_id:10817, stuck notify num:1, timeout:1836.[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1645]
        The 0 stuck notify wait context info:(context_id=7, notify_id=127).[FUNC:ProcessStarsHcclFftsPlusTimeoutErrorInfo][FILE:device_error_proc.cc][LINE:1652]
@@@
The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [unknown].base information: [streamID:[2550120140], taskID[10817], tag[AllReduce_172.17.0.5%eth0_60000_0_1734072939759954], AlgType(level 0-1-2):[fullmesh-ring-ring].] task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.17.0.5/0], Arrival Time:[Fri Dec 13 15:27:15 2024], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]
        rtStreamSynchronize execute failed, reason=[fftsplus timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        synchronize stream failed, runtime result = 507048[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

EI0002: [PID: 19055] 2024-12-13-15:39:42.848.828 The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [unknown].base information: [streamID:[2514751180], taskID[10817], tag[AllReduce_172.17.0.5%eth0_60000_0_1734072939759954], AlgType(level 0-1-2):[fullmesh-ring-ring].] task information: [
there are(is) 1 abnormal device(s):
        Cluster Exception Location[IP/ID]:[172.17.0.5/0], Arrival Time:[Fri Dec 13 15:27:14 2024], ExceptionType:[Stuck Occurred], Possible Reason:1.Host process is stuck, 2.Device task is stuck
]
        Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
        Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document

Ascend/MindSpeed-MM
暂停

内容风险标识

评论 (11)

Ascend/MindSpeed-MM暂停 .gitee-modal { width: 500px !important; }

内容风险标识