name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
[transformer_dynamic][pynative][910/GPU]网络训练失败且有进程残留,导致后续用例失败
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend/
Mindspore版本:r2.1_master_20230602121730_cf123905
Run包:HiAI/HISI_C30/20230518
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
用例仓地址:solution_test/cases/02network/02nlp/transformer_dynamic/train/
用例:
ms_transformer_dynamic_wmt_english_german_pynative_train_infer_epoch3_910_8p_0001
网络transformer_dynamic在pynative下正常运行,告警日志较少
raise exp
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper
func(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/tbe/common/repository_manager/route.py", line 275, in task_distribute
key, func_name, detail = resource_proxy[TASK_QUEUE].get()
File "<string>", line 2, in get
File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
kind, result = conn.recv()
File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
[ERROR] TBE(40275,python):2023-06-03-06:11:06.149.860 [../repository_manager/utils/repository_manager_log.py:30][log] [utils/common.py:100][repository_manager] The main process does not exist. We would kill multiprocess manager process: 39242.
[ERROR] TBE(40274,python):2023-06-03-06:11:06.440.555 [../repository_manager/utils/repository_manager_log.py:30][log] [route.py:60][repository_manager] Subprocess[task_distribute] raise error[]
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
len(cache))
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node 10-90-66-65 exited on signal 6 (Aborted).
走给罗超
Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
根因分析:由于多线程引入的竞争问题
解决方案:涉及加锁
该网络由于本问题阻塞,统一走给baihuawei
[ST][MS][2.1][asr-dynamic][pynative][ascend 8p]网络训练报错 DropOutGenMask算子执行失败
https://e.gitee.com/mind_spore/dashboard?issue=I7F2BX
回归版本: r2.1_master_20230712_5cb3ff66
编译时间: 20230713
回归步骤:参考issue复现步骤
基本功能:问题已解决
测试结论:回归通过
回归人员:孙佳伟
回归时间:20230713
登录 后才可以发表评论