2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

【MindSpore】【Ascend】【C类】BGCF网络模型使用r1.3版本在arm+ubuntu训练环境下性能只有31579,未达到B类给定的性能规格36825

DONE
Bug-Report
Opened this issue  
2022-01-13 15:36
问题 关于 标签
问题报告 使用此模板报告BUG 种类/bug

- 软件环境:
-- 源版本:--
-- Python版本(如Python 3.7.5):Python 3.7.5
-- 操作系统平台及分布(如Linux Ubuntu 18.04):Linux Ubuntu 18.04 + aarch64
-- MindSpore版本:mindspore-ascend 1.3.0

##复现步骤
按照readme.md文件操作指导开展网络模型1p训练

##问题现象(附报错日志上下文):
训练问题:
B类给定性能规格为:按per step time:0.135779看护;换算结果为batch_size/time=5000/0.135779≈36825
模型在 arm+ubuntu环境 训练过程中的 性能不达标 ,换算结果为batch_size×12/Average(cost)=5000×12/1.9≈31579
训练结果日志如下:

{'dataset': 'choose which dataset', 'datapath': 'minddata path', 'Ks': 'top K', 'workers': 'number of process to generate data', 'ckptpath': 'checkpoint path', 'epsilon': 'optimizer parameter', 'learning_rate': 'learning rate', 'l2': 'l2 coefficient', 'activation': "activation function, choices in ['relu', 'tanh'].", 'neighbor_dropout': 'dropout ratio for different aggregation layer', 'log_name': 'log name', 'num_epoch': 'epoch sizes for training', 'input_dim': 'user and item embedding dimension, choices in [64, 128]', 'batch_pairs': 'batch size', 'eval_interval': 'evaluation interval', 'num_neg': 'negative sampling rate ', 'raw_neighs': 'num of sampling neighbors in raw graph', 'gnew_neighs': 'num of sampling neighbors in sample graph', 'embedded_dimension': 'output embedding dim', 'dist_reg': 'distance loss coefficient', 'device_target': "device target, choices in ['Ascend', GPU]", 'device_id': 'Device id', 'ckpt_file': 'Checkpoint file path.', 'file_name': 'output file name.', 'file_format': "file format, choices in ['AIR', 'ONNX', 'MINDIR']", 'row_neighs': 'num of sampling neighbors in raw graph'}
[WARNING] ME(109451:281473040572432,_GeneratorWorkerMp-1):2022-01-12-06:24:46.311.233 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109455:281473040572432,_GeneratorWorkerMp-5):2022-01-12-06:24:46.312.081 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109456:281473040572432,_GeneratorWorkerMp-6):2022-01-12-06:24:46.315.739 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109452:281473040572432,_GeneratorWorkerMp-2):2022-01-12-06:24:46.315.900 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109454:281473040572432,_GeneratorWorkerMp-4):2022-01-12-06:24:46.316.132 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109458:281473040572432,_GeneratorWorkerMp-8):2022-01-12-06:24:46.317.834 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109457:281473040572432,_GeneratorWorkerMp-7):2022-01-12-06:24:46.318.209 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109453:281473040572432,_GeneratorWorkerMp-3):2022-01-12-06:24:46.318.207 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.617.830 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.135 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.267 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.401 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.518 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.663 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] SESSION(109290,fffe377fe1f0,python):2022-01-12-06:25:02.078.296 [mindspore/ccsrc/backend/session/ascend_session.cc:1377] SelectKernel] There are 2 node/nodes used raise precision to selected the kernel!
[WARNING] SESSION(109290,fffe377fe1f0,python):2022-01-12-06:25:02.078.394 [mindspore/ccsrc/backend/session/ascend_session.cc:1381] SelectKernel] There are 6 node/nodes used reduce precision to selected the kernel!
Epoch 001 iter 12 loss 34696.863, cost:52.4288
Epoch 002 iter 12 loss 34288.637, cost:1.3063
Epoch 003 iter 12 loss 30985.635, cost:1.2516
Epoch 004 iter 12 loss 22491.91, cost:1.2746
Epoch 005 iter 12 loss 21087.371, cost:1.9760
Epoch 006 iter 12 loss 19150.377, cost:1.9044
Epoch 007 iter 12 loss 18561.326, cost:2.2398
Epoch 008 iter 12 loss 18068.207, cost:1.9947
Epoch 009 iter 12 loss 16396.041, cost:2.0078
Epoch 010 iter 12 loss 15766.55, cost:1.7990
Epoch 011 iter 12 loss 14308.345, cost:1.9249
...
Epoch 595 iter 12 loss 3645.4429, cost:1.9038
Epoch 596 iter 12 loss 3667.8376, cost:1.9782
Epoch 597 iter 12 loss 3667.6663, cost:1.9115
Epoch 598 iter 12 loss 3664.8555, cost:1.9984
Epoch 599 iter 12 loss 3681.8513, cost:1.9173
Epoch 600 iter 12 loss 3674.1487, cost:2.0402

##本期特别说明

Comments (6)

wangxingang createdBug-Report

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可以直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

fangwenyi changed issue state from TODO to ACCEPTED
fangwenyi set assignee to oacjiewen
fangwenyi set priority to Main
fangwenyi added
 
kind/bug
label
fangwenyi added
 
mindspore-assistant
label
fangwenyi set milestone to B-SIG-ModelZoo
oacjiewen assigned collaborator oacjiewen
oacjiewen changed assignee from oacjiewen to zhouneng

Appearance & Root Cause
CPU差异导致在ARM上性能不及X64架构。 现增加每个D卡对应的数据处理可以使用的并发线程数目,提高数据处理的性能。 自测已经达标
Fix Solution
https://gitee.com/mindspore/models/pulls/1908
https://gitee.com/mindspore/models/pulls/1909

zhouneng changed issue state from ACCEPTED to VALIDATION
zhouneng changed milestone from B-SIG-ModelZoo to B-SolutionTest
zhouneng assigned collaborator zhouneng
zhouneng changed assignee from zhouneng to fangwenyi
fangwenyi assigned collaborator fangwenyi
fangwenyi changed assignee from fangwenyi to zhouneng
fangwenyi unassigned collaborator zhouneng

@wangxingang 请使用zhouneng提供的PR解决,谢谢
由于长时间没有反馈,此ISSUE先关闭,如有需要请提供进一步信息,然后将ISSUE状态修改为WIP,我们这边会进一步跟踪

fangwenyi changed issue state from VALIDATION to DONE
wangxingang changed issue state from DONE to VALIDATION
xiangjiawei007 changed milestone from B-SolutionTest to B-Models-Test(deleted)
fangwenyi removed
 
mindspore-assistant
label
liangyongxiong changed priority from Main to Not specified
liangyongxiong changed milestone from B-Models-Test(deleted) to B-SIG-ModelZoo
wangxingang changed issue state from VALIDATION to DONE
wangxingang changed issue state from DONE to VALIDATION

BGCF网络模型使用r1.3版本在arm+ubuntu训练环境下进行1p训练,epco设置为12训练性能达标batch_size×12/Average(cost)=5000×12/1.3006≈41465,问题验证通过

ed memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(112819:281473100517392,_GeneratorWorkerMp-12):2022-03-11-08:08:00.756.0 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.565.966 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.566.160 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.566.245 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.566.321 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.566.392 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(112592,fffdf0ff91f0,python):2022-03-11-08:08:10.566.454 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] SESSION(112592,fffdf0ff91f0,python):2022-03-11-08:08:12.420.210 [mindspore/ccsrc/backend/session/ascend_session.cc:1377] SelectKernel] There are 2 node/nodes used raise precision to selected the kernel!
[WARNING] SESSION(112592,fffdf0ff91f0,python):2022-03-11-08:08:12.420.254 [mindspore/ccsrc/backend/session/ascend_session.cc:1381] SelectKernel] There are 6 node/nodes used reduce precision to selected the kernel!
Epoch 001 iter 12 loss 34699.824, cost:49.4840
Epoch 002 iter 12 loss 34308.19, cost:1.3330
Epoch 003 iter 12 loss 31289.824, cost:1.2455
Epoch 004 iter 12 loss 21596.953, cost:1.1656
Epoch 005 iter 12 loss 20476.37, cost:1.2068
Epoch 006 iter 12 loss 18696.338, cost:1.6327
Epoch 007 iter 12 loss 18009.941, cost:1.5594
Epoch 008 iter 12 loss 17321.035, cost:1.5514
Epoch 009 iter 12 loss 16437.7, cost:1.4787
Epoch 010 iter 12 loss 15583.92, cost:1.4958
Epoch 011 iter 12 loss 14881.423, cost:1.5666
Epoch 012 iter 12 loss 14206.686, cost:1.6879

wangxingang changed issue state from VALIDATION to DONE

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
参与者(6)
6560352 oacjiewen 1584266306
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

Search