99 Star 788 Fork 1.4K

MindSpore / models

 / 详情

【MindSpore】【Ascend】【C类】BGCF网络模型使用r1.3版本在arm+ubuntu训练环境下性能只有31579,未达到B类给定的性能规格36825

DONE
Bug-Report
创建于  
2022-01-13 16:17

一、问题现象(附报错日志上下文):
B类给定性能规格为:按per step time:0.135779看护;换算结果为batch_size/time=5000/0.135779≈36825
模型在 arm+ubuntu环境 训练过程中的 性能不达标 ,换算结果为batch_size×12/Average(cost)=5000×12/1.9≈31579
训练结果日志如下:

{'dataset': 'choose which dataset', 'datapath': 'minddata path', 'Ks': 'top K', 'workers': 'number of process to generate data', 'ckptpath': 'checkpoint path', 'epsilon': 'optimizer parameter', 'learning_rate': 'learning rate', 'l2': 'l2 coefficient', 'activation': "activation function, choices in ['relu', 'tanh'].", 'neighbor_dropout': 'dropout ratio for different aggregation layer', 'log_name': 'log name', 'num_epoch': 'epoch sizes for training', 'input_dim': 'user and item embedding dimension, choices in [64, 128]', 'batch_pairs': 'batch size', 'eval_interval': 'evaluation interval', 'num_neg': 'negative sampling rate ', 'raw_neighs': 'num of sampling neighbors in raw graph', 'gnew_neighs': 'num of sampling neighbors in sample graph', 'embedded_dimension': 'output embedding dim', 'dist_reg': 'distance loss coefficient', 'device_target': "device target, choices in ['Ascend', GPU]", 'device_id': 'Device id', 'ckpt_file': 'Checkpoint file path.', 'file_name': 'output file name.', 'file_format': "file format, choices in ['AIR', 'ONNX', 'MINDIR']", 'row_neighs': 'num of sampling neighbors in raw graph'}
[WARNING] ME(109451:281473040572432,_GeneratorWorkerMp-1):2022-01-12-06:24:46.311.233 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109455:281473040572432,_GeneratorWorkerMp-5):2022-01-12-06:24:46.312.081 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109456:281473040572432,_GeneratorWorkerMp-6):2022-01-12-06:24:46.315.739 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109452:281473040572432,_GeneratorWorkerMp-2):2022-01-12-06:24:46.315.900 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109454:281473040572432,_GeneratorWorkerMp-4):2022-01-12-06:24:46.316.132 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109458:281473040572432,_GeneratorWorkerMp-8):2022-01-12-06:24:46.317.834 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109457:281473040572432,_GeneratorWorkerMp-7):2022-01-12-06:24:46.318.209 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] ME(109453:281473040572432,_GeneratorWorkerMp-3):2022-01-12-06:24:46.318.207 [mindspore/dataset/engine/queue.py:106] Using shared memory queue, but rowsize is larger than allocated memory max_rowsize 6291456 current rowwize 13200000
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.617.830 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.135 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.267 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.401 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.518 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] DEVICE(109290,fffe377fe1f0,python):2022-01-12-06:24:59.618.663 [mindspore/ccsrc/runtime/device/ascend/kernel_select_ascend.cc:284] TagRaiseReduce] node:[DropoutGenMask]reduce precision from int64 to int32
[WARNING] SESSION(109290,fffe377fe1f0,python):2022-01-12-06:25:02.078.296 [mindspore/ccsrc/backend/session/ascend_session.cc:1377] SelectKernel] There are 2 node/nodes used raise precision to selected the kernel!
[WARNING] SESSION(109290,fffe377fe1f0,python):2022-01-12-06:25:02.078.394 [mindspore/ccsrc/backend/session/ascend_session.cc:1381] SelectKernel] There are 6 node/nodes used reduce precision to selected the kernel!
Epoch 001 iter 12 loss 34696.863, cost:52.4288
Epoch 002 iter 12 loss 34288.637, cost:1.3063
Epoch 003 iter 12 loss 30985.635, cost:1.2516
Epoch 004 iter 12 loss 22491.91, cost:1.2746
Epoch 005 iter 12 loss 21087.371, cost:1.9760
Epoch 006 iter 12 loss 19150.377, cost:1.9044
Epoch 007 iter 12 loss 18561.326, cost:2.2398
Epoch 008 iter 12 loss 18068.207, cost:1.9947
Epoch 009 iter 12 loss 16396.041, cost:2.0078
Epoch 010 iter 12 loss 15766.55, cost:1.7990
Epoch 011 iter 12 loss 14308.345, cost:1.9249
...
Epoch 595 iter 12 loss 3645.4429, cost:1.9038
Epoch 596 iter 12 loss 3667.8376, cost:1.9782
Epoch 597 iter 12 loss 3667.6663, cost:1.9115
Epoch 598 iter 12 loss 3664.8555, cost:1.9984
Epoch 599 iter 12 loss 3681.8513, cost:1.9173
Epoch 600 iter 12 loss 3674.1487, cost:2.0402

二、软件版本:
-- CANN 版本: (CANN 5.0.2 B058)
--Python 版本:Python 3.7.5
--操作系统版本 (e.g., Ubuntu 18.04):Ubuntu 18.04

三、测试步骤:
1、按照readme.md文件操作指导开展网络模型1p训练;

评论 (2)

wangxingang 创建了Bug-Report
fangwenyi 任务状态TODO 修改为ACCEPTED
fangwenyi 添加了
 
mindspore-assistant
标签
fangwenyi 负责人设置为liubuyu
展开全部操作日志
liubuyu 任务状态ACCEPTED 修改为VALIDATION
wangxingang 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
6574993 liubuyu 1584443152
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助