lora微调deepseek 报错（RuntimeError: copy_d2d_baseformat_opapi:torch_npu/csrc/aten/ops/op_api/CopyKernelOpApi.cpp:164 NPU error, error code is 507015）

一、问题现象（附报错日志上下文）：
(DTS) [ma-user DTS-SQL]$python schema_linking_finetuning.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:03<00:00, 31.66s/it]
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2024-10-09 10:31:28,134] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to npu (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
wandb: Tracking run with wandb version 0.18.3
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
  0%|                                                                                                                                                                                                                 | 0/795 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/data/tyc/DTS-SQL/schema_linking_finetuning.py", line 161, in <module>
    trainer.train()
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 434, in train
    output = super().train(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/transformers/trainer.py", line 2434, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/accelerate/accelerator.py", line 2347, in clip_grad_norm_
    return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/torch/nn/utils/clip_grad.py", line 61, in clip_grad_norm_
    total_norm = torch.linalg.vector_norm(torch.stack([norm.to(first_device) for norm in norms]), norm_type)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/torch/nn/utils/clip_grad.py", line 61, in <listcomp>
    total_norm = torch.linalg.vector_norm(torch.stack([norm.to(first_device) for norm in norms]), norm_type)
RuntimeError: copy_d2d_baseformat_opapi:torch_npu/csrc/aten/ops/op_api/CopyKernelOpApi.cpp:164 NPU error, error code is 507015
[ERROR] 2024-10-09-10:32:25 (PID:113114, Device:2, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The aicore execution is abnormal. 
        Rectify the fault based on the error information in the ascend log.
EZ9999: Inner Error!
EZ9999: [PID: 113114] 2024-10-09-10:32:25.284.271 The error from device(chipId:2, dieId:0), serial number is 8, there is an fftsplus aivector error exception, core id is 30, error code = 0, dump info: pc start: 0x124c6be093dc, current: 0x124c6be098d8, vec error info: 0x730d501e08, mte error info: 0xf730cc096, ifu error info: 0x1e7fff104f000, ccu error info: 0xbb97a81000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124380240080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1218]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x30cc096, fixp_error1 info: 0xf fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1230]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1221]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=25726, flip_num=0, fault kernel_name=LpNormV2_f1a7bd09d5043bf610882013edb3df3e_high_performance_1100400_mix_aiv, fault kernel info ext=none, program id=235, hash=4067327519633039264.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1062]
       rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

Traceback (most recent call last):
  File "/data/tyc/DTS-SQL/schema_linking_finetuning.py", line 161, in <module>
    trainer.train()
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 434, in train
    output = super().train(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/transformers/trainer.py", line 2434, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/accelerate/accelerator.py", line 2347, in clip_grad_norm_
    return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/torch/nn/utils/clip_grad.py", line 61, in clip_grad_norm_
    total_norm = torch.linalg.vector_norm(torch.stack([norm.to(first_device) for norm in norms]), norm_type)
  File "/home/ma-user/anaconda3/envs/DTS/lib/python3.9/site-packages/torch/nn/utils/clip_grad.py", line 61, in <listcomp>
    total_norm = torch.linalg.vector_norm(torch.stack([norm.to(first_device) for norm in norms]), norm_type)
RuntimeError: copy_d2d_baseformat_opapi:torch_npu/csrc/aten/ops/op_api/CopyKernelOpApi.cpp:164 NPU error, error code is 507015
[ERROR] 2024-10-09-10:32:25 (PID:113114, Device:2, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The aicore execution is abnormal. 
        Rectify the fault based on the error information in the ascend log.
EZ9999: Inner Error!
EZ9999: [PID: 113114] 2024-10-09-10:32:25.284.271 The error from device(chipId:2, dieId:0), serial number is 8, there is an fftsplus aivector error exception, core id is 30, error code = 0, dump info: pc start: 0x124c6be093dc, current: 0x124c6be098d8, vec error info: 0x730d501e08, mte error info: 0xf730cc096, ifu error info: 0x1e7fff104f000, ccu error info: 0xbb97a81000000000, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124380240080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1218]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0x8000, 0) errorStr: When the D-cache reads and writes data to the UB, the response value returned by the bus is a non-zero value. fixp_error0 info: 0x30cc096, fixp_error1 info: 0xf fsmId:0, tslot:7, thread:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1230]
       Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1221]
       AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=25726, flip_num=0, fault kernel_name=LpNormV2_f1a7bd09d5043bf610882013edb3df3e_high_performance_1100400_mix_aiv, fault kernel info ext=none, program id=235, hash=4067327519633039264.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1062]
       rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
       synchronize stream failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /data/tyc/DTS-SQL/wandb/offline-run-20241009_103151-ggnwxy9b
wandb: Find logs at: wandb/offline-run-20241009_103151-ggnwxy9b/logs
[W NPUStream.cpp:382] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EZ9999: Inner Error!
EZ9999: [PID: 113114] 2024-10-09-10:41:49.828.441 The error from device(chipId:3, dieId:0), serial number is 7, there is an fftsplus aivector error exception, core id is 0, error code = 0, dump info: pc start: 0x124cabc093dc, current: 0x124cabc098c8, vec error info: 0xc10a5ad222, mte error info: 0x180f1f09b9, ifu error info: 0x1dfeb0fa508c0, ccu error info: 0x851c3b0d101f8010, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x1244c0240080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1218]
        TraceBack (most recent call last):
       The extend info: errcode:(0, 0, 0) errorStr: timeout or trap error. fixp_error0 info: 0xf1f09b9, fixp_error1 info: 0x18 fsmId:0, tslot:7, thread:0, ctxid:0, blk:8, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1230]
       Kernel task happen error, retCode=0x25, [aicore timeout].[FUNC:PreCheckTaskErr][FILE:davinic_kernel_task.cc][LINE:1221]
       AICORE Kernel task happen error, retCode=0x25.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       Aicore kernel execute failed, device_id=3, stream_id=2, report_stream_id=2, task_id=25714, flip_num=0, fault kernel_name=LpNormV2_f1a7bd09d5043bf610882013edb3df3e_high_performance_1100400_mix_aiv, fault kernel info ext=none, program id=235, hash=4067327519633039264.[FUNC:GetError][FILE:stream.cc][LINE:1062]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1062]
       rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
       wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
 (function npuSynchronizeUsedDevices)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:50.269.791 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:50.505.915 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:50.847.129 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:51.143.430 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:51.441.572 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:51.727.173 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:52.012.528 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)
[W NPUStream.cpp:365] Warning: NPU warning, error code is 507014[Error]: 
[Error]: The aicore execution times out. 
        Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
        rtDeviceSynchronize execute failed, reason=[aicore timeout][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 113114] 2024-10-09-10:41:52.300.898 wait for compute device to finish failed, runtime result = 507014.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        TraceBack (most recent call last):
 (function npuSynchronizeDevice)

二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x，5.x.x):
     8.0.T37
--Tensorflow/Pytorch/MindSpore 版本:
    pytorch=2.1.0, torch_npu=2.1.0.post3
--Python 版本 (e.g., Python 3.7.5): 
    Python 3.9.19
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04): 
    EulerOS 2.0 (SP10)

三、测试步骤：
lora微调deepseek，导入模型时候使用了device_map="auto"报上述错误，不auto用npu0的话，out of memory。

四、日志信息:
xxxx
请根据自己的运行环境参考以下方式搜集日志信息，如果涉及到算子开发相关的问题，建议也提供UT/ST测试和单算子集成测试相关的日志。

日志提供方式:
将日志打包后作为附件上传。若日志大小超出附件限制，则可上传至外部网盘后提供链接。

获取方法请参考wiki：
https://gitee.com/ascend/modelzoo/wikis/%E5%A6%82%E4%BD%95%E8%8E%B7%E5%8F%96%E6%97%A5%E5%BF%97%E5%92%8C%E8%AE%A1%E7%AE%97%E5%9B%BE?sort_id=4097825

Ascend/modelzoo
暂停

内容风险标识

评论 (0)

Ascend/modelzoo暂停 .gitee-modal { width: 500px !important; }

内容风险标识