78 Star 599 Fork 1.2K

Ascend/pytorch

权重转换报错Cannot re-initialize NPU in forked subprocess. To use NPU with multiprocessing, you must use the 'spawn' start method

DONE
缺陷
创建于  
2024-07-17 15:30

torch:2.3.1
torch_npu:2.3.1
cann:8.0rc2
modellink-master

输入图片说明
Traceback (most recent call last):
File "/usr/local/python3.8.19/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/python3.8.19/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/job/code/tools/checkpoint/saver_megatron.py", line 553, in save_model_checkpoint
models = get_models(args.target_tensor_parallel_size, md.params_dtype, True, post_process)
File "/job/code/tools/checkpoint/saver_megatron.py", line 515, in get_models
models = [model_provider(pre_process, post_process).to(dtype) for _ in range(count)]
File "/job/code/tools/checkpoint/saver_megatron.py", line 515, in
models = [model_provider(pre_process, post_process).to(dtype) for _ in range(count)]
File "/job/code/pretrain_gpt.py", line 87, in model_provider
model = megatron.legacy.model.GPTModel(
File "/job/code/modellink/model/gpt_model.py", line 43, in init
self.language_model, self._language_model_key = get_language_model(
File "/job/code/megatron/legacy/model/language_model.py", line 67, in get_language_model
language_model = TransformerLanguageModel(
File "/job/code/modellink/model/language_model.py", line 97, in transformer_language_model_init
self.rotary_pos_emb = RotaryEmbedding(
File "/tmp/MindSpeed/mindspeed/core/fusions/rotary_pos_embedding.py", line 30, in wrapper
fn(self, *args, **kwargs)
File "/job/code/megatron/core/models/common/embeddings/rotary_pos_embedding.py", line 77, in init
torch.arange(0, dim, 2, dtype=torch.float32, device=torch.cuda.current_device())
File "/usr/local/python3.8.19/lib/python3.8/site-packages/torch_npu/npu/utils.py", line 59, in current_device
torch_npu.npu._lazy_init()
File "/usr/local/python3.8.19/lib/python3.8/site-packages/torch_npu/npu/init.py", line 210, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize NPU in forked subprocess. To use NPU with multiprocessing, you must use the 'spawn' start method

评论 (1)

xlianhao 创建了缺陷 11个月前

报错已经说明了原因了,不能在fork进程中重复初始化,建议使用spawn起进程

huangyunlong 任务状态TODO 修改为DONE 11个月前

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
优先级
预计工期 (小时)
开始日期   -   截止日期
-
置顶选项
参与者(2)
huangyunlong-huangyunlong2022 xlianhao-xlianhao
Python
1
https://gitee.com/ascend/pytorch.git
git@gitee.com:ascend/pytorch.git
ascend
pytorch
pytorch

搜索帮助