2.3K Star 8K Fork 4.2K

GVPMindSpore / mindspore

 / 详情

[ST][MS][master][NET][transformer_dynamic/ASR-dynamic][pynative][910]网络训练core-dump

DONE
Bug-Report
创建于  
2023-06-06 15:22
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[transformer_dynamic][pynative][910/GPU]网络训练失败且有进程残留,导致后续用例失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device /ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

Mindspore版本:r2.1_master_20230602121730_cf123905
Run包:HiAI/HISI_C30/20230518

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:solution_test/cases/02network/02nlp/transformer_dynamic/train/
用例:
ms_transformer_dynamic_wmt_english_german_pynative_train_infer_epoch3_910_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from MODEL_INTERNAL_ROOT
  2. cd /transformer_dynamic/
  3. set mode=pynative
  4. model_train = """cd {0};bash -x scripts/run_distribute_train_ascend.sh 8 {1} {2} {3} > {4} 2>&1 """.format(
    self.model_path, self.net_param_dict.get("EpochNum"), self.train_data, self.config_file, self.sh_train_log)
  5. 网络transformer_dynamic在pynative下正常运行,告警日志较少

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络transformer_dynamic在pynative下正常运行,告警日志较少

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

  raise exp
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/tbe/common/repository_manager/route.py", line 58, in wrapper
    func(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/tbe/common/repository_manager/route.py", line 275, in task_distribute
    key, func_name, detail = resource_proxy[TASK_QUEUE].get()
  File "<string>", line 2, in get
  File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
    kind, result = conn.recv()
  File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
  len(cache))
[ERROR] TBE(40275,python):2023-06-03-06:11:06.149.860 [../repository_manager/utils/repository_manager_log.py:30][log] [utils/common.py:100][repository_manager] The main process does not exist. We would kill multiprocess manager process: 39242.
[ERROR] TBE(40274,python):2023-06-03-06:11:06.440.555 [../repository_manager/utils/repository_manager_log.py:30][log] [route.py:60][repository_manager] Subprocess[task_distribute] raise error[]
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
  len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
  len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown
  len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
  len(cache))
/home/miniconda3/envs/ci/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 235 leaked semaphores to clean up at shutdown
  len(cache))
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node 10-90-66-65 exited on signal 6 (Aborted).

Special notes for this issue/备注 (Optional / 选填)

走给罗超

评论 (7)

sunjiawei999 创建了Bug-Report
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
v2.1.0
标签
sunjiawei999 添加了
 
sig/pynative
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

sunjiawei999 修改了描述
sunjiawei999 修改了描述
zhongjicheng 负责人zhongjicheng 修改为luochao60
luochao60 任务状态TODO 修改为WIP

根因分析:由于多线程引入的竞争问题
解决方案:涉及加锁

luochao60 添加了
 
rct/newfeature
标签
luochao60 添加了
 
rca/others
标签
luochao60 添加了
 
ctl/solutiontest
标签
luochao60 里程碑B-SIG-PYNATIVE 修改为B-SolutionTest
luochao60 任务状态WIP 修改为VALIDATION
luochao60 添加协作者luochao60
luochao60 负责人luochao60 修改为sunjiawei999

该网络由于本问题阻塞,统一走给baihuawei
[ST][MS][2.1][asr-dynamic][pynative][ascend 8p]网络训练报错 DropOutGenMask算子执行失败
https://e.gitee.com/mind_spore/dashboard?issue=I7F2BX

sunjiawei999 负责人sunjiawei999 修改为baihuawei
sunjiawei999 任务状态VALIDATION 修改为TODO
baihuawei 添加协作者baihuawei
baihuawei 负责人baihuawei 修改为nomindcarry
wuweikang 里程碑B-SIG-Parallel 修改为B-SIG-PYNATIVE
nomindcarry 任务状态TODO 修改为WIP
i-robot 添加了
 
foruda
标签
nomindcarry 任务状态WIP 修改为VALIDATION
nomindcarry 添加协作者nomindcarry
nomindcarry 负责人nomindcarry 修改为sunjiawei999

回归版本: r2.1_master_20230712_5cb3ff66

编译时间: 20230713

回归步骤:参考issue复现步骤

基本功能:问题已解决

测试结论:回归通过

回归人员:孙佳伟

回归时间:20230713
输入图片说明

sunjiawei999 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(6)
5581011 nomindcarry 1663916060
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助