一、问题现象(附报错日志上下文):
线下D910b4机器,使用docker进行训练
单卡单进程可以正常训:
ASCEND_RT_VISIBLE_DEVICES=0 RANK=0 LOCAL_RANK=0 WORLD_SIZE=1 python torchrun_main.py --opt configs/options/segment_mt_npu.yml
但是一上多卡,就报错:
ASCEND_RT_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 torchrun_main.py --opt configs/options/segment_mt_npu.yml
报错信息:
Traceback (most recent call last):
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
main(project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
build_train(opt)(None, project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
model = build_model(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
model = MODEL_REGISTRY.get(opt['model_type'])(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
self.net_g = self.model_to_device(self.net_g)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
net = DistributedDataParallel(
File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
return fn(*args, **kwargs)
File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2887, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
Solution: Wait for 10s after killing the last training process and try again.
TraceBack (most recent call last):
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]
二、软件版本:
host驱动:Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run
Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run
Version=23.0.3
ascendhal_version=7.35.19
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C15SPC005B220
compatible_version=[V100R001C30],[V100R001C13],[V100R001C15]
compatible_version_fw=[7.0.0,7.1.99]
package_version=23.0.3
-- CANN 版本 : 7.2.T7.0.B121:8.0.T5
-- Pytorch版本: 1.11.0
-- torch-npu: 1.11.0.post11和1.11.0.post12都尝试了,问题一致
--Python 版本: 3.9.18
-- MindStudio版本:不涉及
--操作系统版本: eulerosv2r10(Linux 26e28b339282 4.19.90-vhulk2209.2.0.h1327.eulerosv2r10.aarch64)
三、测试步骤:
docker启动命令:
docker run -itd --privileged --pid=host --cap-add=SYS_PTRACE --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /etc/ascend_install.info:/etc/ascend_install.info -v /sys/fs/cgroup:/sys/fs/cgroup:ro -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /sbin/dmidecode:/sbin/dmidecode -v /dev/mem:/dev/mem -v /usr/bin/hostname:/usr/bin/hostname -v /usr/bin/hccn_tool:/usr/bin/hccn_tool -v /opt/mnt1/:/opt/mnt1/ -p 3006:22 --name test_env npu_train:v0.1 /bin/bash
四、日志信息:
详见附件
完整报错信息
ASCEND_RT_VISIBLE_DEVICES=6,7 python -m torch.distributed.launch --nproc_per_node 2 torchrun_main.py --opt configs/options/segment_mt_npu.yml
/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/usr/local/python3/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
warnings.warn(
/usr/local/python3/lib/python3.9/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
warnings.warn(
/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py:163: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
2000-02-20 12:13:18,907 INFO:
666666
Version Information:
PyTorch: 1.11.0
TorchVision: 0.15.0
/usr/local/python3/lib/python3.9/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /root/pytorch/aten/src/ATen/native/TensorShape.cpp:2227.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/usr/local/python3/lib/python3.9/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /root/pytorch/aten/src/ATen/native/TensorShape.cpp:2227.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
2000-02-20 12:13:30,241 INFO: Network [SwinUNetOneOut] is created.
2000-02-20 12:13:30,241 INFO: Loading SwinUNetOneOut model from ../pretrained_models/seg_mt_SwinUNetOneOut_iter0_random_init.pth.
Traceback (most recent call last):
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
main(project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
build_train(opt)(None, project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
model = build_model(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
model = MODEL_REGISTRY.get(opt['model_type'])(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
self.net_g = self.model_to_device(self.net_g)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
net = DistributedDataParallel(
File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
return fn(*args, **kwargs)
File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2887, Device:0, RankID:0) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
Solution: Wait for 10s after killing the last training process and try again.
TraceBack (most recent call last):
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]
Traceback (most recent call last):
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 53, in <module>
main(project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/torchrun_main.py", line 41, in main
build_train(opt)(None, project_path)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/core_code/core/segment_train.py", line 101, in train_segment
model = build_model(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/__init__.py", line 27, in build_model
model = MODEL_REGISTRY.get(opt['model_type'])(opt)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/models/seg_model.py", line 107, in __init__
self.net_g = self.model_to_device(self.net_g)
File "/opt/mnt1/ql/code/202405_NPU_train/Swin_s/0506_tuner_with_npu_ql/third_party/basicsr/models/base_model.py", line 153, in model_to_device
net = DistributedDataParallel(
File "/usr/local/python3/lib/python3.9/site-packages/torch_npu/contrib/transfer_to_npu.py", line 67, in decorated
return fn(*args, **kwargs)
File "/usr/local/python3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 641, in __init__
dist._verify_params_across_processes(self.process_group, parameters)
RuntimeError: [ERROR] HCCL error in: torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:81
[ERROR] 2000-02-20-12:13:33 (PID:2888, Device:1, RankID:1) ERR02200 DIST call hccl api failed.
EJ0001: Failed to initialize the HCCP process. Reason: Maybe the last training process is running.
Solution: Wait for 10s after killing the last training process and try again.
TraceBack (most recent call last):
tsd client wait response fail, device response code[1]. unknown device error.[FUNC:WaitRsp][FILE:process_mode_manager.cpp][LINE:283]
Fail to get sq reg virtual addr, deviceId=7, sqId=4.[FUNC:Setup][FILE:stream.cc][LINE:1102]
stream setup failed, retCode=0x7020010.[FUNC:SyncGetDevMsg][FILE:api_impl.cc][LINE:4608]
Sync get device msg failed, retCode=0x7020010.[FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:4669]
rtGetDevMsg execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2887 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 2888) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/python3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/python3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
torchrun_main.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2000-02-20_12:13:40
host : d16eae4371b2
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2888)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2888
======================================================
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
尝试将进程杀干净后等待一段时间再跑
登录 后才可以发表评论