2.4K Star 8.1K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][deeplabv3plus-s16/dbnet_r18/resnet50/inceptionv3/cyclegan等网络][GPU ]网络训练告警日志过多导致用例失败,GetFormatFromStrToEnum]The data format can not be converted to enum

DONE
Bug-Report
创建于  
2023-11-29 10:43
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[resnext50/vgg16][GPU ]网络训练告警日志过多导致用例失败
模型仓地址:https://gitee.com/mindspore/models/tree/master/official/cv/DeepLabV3P/scripts

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    mindspore版本:2.2.10.20231125,commit_id = '[sha1]:1e6bd3d7,[branch]:(HEAD,origin/r2.2.10,r2.2.10)'
    run:Milan_C15/20231122

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_alexnet_cifar10_train_infer_910_gpu_1p_0001
test_ms_asr_dynamic_shape_an4_32_train_check_loss_gpu_1p_0001
test_ms_bert_finetune_ner_softmax_cluener_train_infer_0002
test_ms_bert_large_cn_news_pynative_train_check_loss_gpu_8p_0001
test_ms_bert_large_cn_news_pynative_train_check_perf_gpu_1p_0001
test_ms_cyclegan_cityscapes_ascend_gpu_cpu_train_check_loss_0002
test_ms_dbnet_r18_icdar2015_gpu_check_loss_8p_0005
test_ms_dbnet_r50_icdar2015_gpu_check_loss_8p_0011
test_ms_deeplabv3_vocaug_cpu_train_check_loss_0001
test_ms_deeplabv3plus_s16_gpu_check_fps_1p_0001
test_ms_deeplabv3plus_s16_gpu_check_loss_8p_0003
test_ms_deeplabv3plus_s16_pynative_train_check_loss_gpu_8p_0003
test_ms_deeplabv3plus_s8_gpu_check_fps_1p_0002
test_ms_deeplabv3plus_s8_gpu_check_loss_8p_0004
test_ms_deeplabv3plus_s8_pynative_train_check_loss_gpu_8p_0004
test_ms_dqn_train_infer_0001
test_ms_efficientnet_cifar10_cpu_train_check_loss_daily_0001
test_ms_efficientnet_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_efficientnet_imagenet2012_train_check_loss_gpu_8p_0001
test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001
test_ms_inceptionv3_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_inceptionv3_imagenet2012_train_check_fps_gpu_1p_0004
test_ms_inceptionv3_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_lstm_aclimdb_train_infer_gpu_1p_0001
test_ms_mobilenetv1_cifar10_cpu_train_check_loss_0001
test_ms_mobilenetv1_imagenet_gpu_check_fps_1p_0004
test_ms_mobilenetv1_imagenet_gpu_train_check_loss_4p_0003
test_ms_mobilenetv2_garbage_cpu_train_infer_0003
test_ms_mobilenetv2_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_mobilenetv3_cpu_check_fps_1p_0001
test_ms_mobilenetv3_imagenet2012_gpu_check_loss_8p_0003
test_ms_pangu_alpha_gpu_train_8p_0001
test_ms_ppo_train_infer_200_episode_gpu_cpu_1p_0001
test_ms_resnet101_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet18_cifar10_pynative_train_check_perf_gpu_1p_0001
test_ms_resnet18_cifar10_pynative_train_infer_gpu_8p_0001
test_ms_resnet18_cifar10_train_check_fps_gpu_1p_0002
test_ms_resnet18_cifar10_train_infer_gpu_8p_0001
test_ms_resnet18_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet18_imagenet_train_check_pfs_gpu_1p_0002
test_ms_resnet50_benchmark_imagenet_pynative_train_check_perf_gpu_8p_0001
test_ms_resnet50_cifar10_pynative_train_check_perf_gpu_1p_0001
test_ms_resnet50_cifar10_pynative_train_infer_gpu_8p_0001
test_ms_resnet50_cifar10_train_check_loss_cpu_0001
test_ms_resnet50_cifar10_train_check_loss_gpu_8p_0002
test_ms_resnet50_cifar10_train_check_perf_gpu_1p_0003
test_ms_resnet50_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet50_imagenet_train_check_loss_gpu_8p_0002
test_ms_resnet50_imagenet_train_check_perf_gpu_1p_0003
test_ms_resnext50_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_resnext50_imagenet2012_train_check_loss_gpu_8p_0001
test_ms_retinaface_resnet50_widerface_pynative_train_check_loss_gpu_4p_0001
test_ms_retinaface_resnet50_widerface_train_check_loss_gpu_8p_0001
test_ms_retinaface_resnet50_widerface_train_check_perf_gpu_1p_0001
test_ms_retinaface_resnet50_widerface_train_no_resume_check_loss_gpu_8p_0001
test_ms_shufflenetv1_gpu_check_fps_1p_0001
test_ms_shufflenetv2_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_shufflenetv2_imagenet2012_train_check_fps_gpu_1p_0004
test_ms_shufflenetv2_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_ssd_helmet_cpu_train_check_loss_0001
test_ms_ssd_mobilenetv1_fpn_coco2017_train_check_fps_gpu_0002
test_ms_ssd_mobilenetv1_fpn_coco2017_train_check_loss_gpu_8p_0003
test_ms_ssd_resnet50_fpn_coco2017_pynative_train_check_loss_gpu_8p_0001
test_ms_ssd_resnet50_fpn_coco2017_train_check_fps_gpu_0002
test_ms_ssd_resnet50_fpn_coco2017_train_check_loss_gpu_8p_0003
test_ms_ssd_vgg16_coco2017_pynative_train_check_loss_gpu_8p_0001
test_ms_ssd_vgg16_coco2017_train_check_fps_gpu_0002
test_ms_ssd_vgg16_coco2017_train_check_loss_gpu_8p_0003
test_ms_unet_plus_gpu_train_infer_1p_0001
test_ms_unet_plus_gpu_train_infer_8p_0002
test_ms_usability_benchmark_graph_cpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_deeplabv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_inceptionv4_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_alexnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ctpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_deeplabv3_plus_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_dqn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_efficientnet_b3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ppo_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnet_18_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnext50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_retinaface_resnet50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_shufflenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_shufflenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_mobilenetv1_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_resnet50_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_yolov5_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_deeplabv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_inceptionv4_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_alexnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ctpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_deeplabv3_plus_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_dqn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_efficientnet_b3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_facerecognition_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_fasterrcnn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ppo_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_101_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_18_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnext50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_retinaface_resnet50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_shufflenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_shufflenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_mobilenetv1_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_resnet50_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_transformer_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_yolov3_darknet53_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_yolov5_time_perf_loss_1p_0001
test_ms_vgg16_imagenet_check_loss_gpu_8p_0001
test_ms_vgg16_imagenet_perf_gpu_1p_0001
test_ms_vgg16_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_wide_deep_criteo_host_device_mix_train_infer_gpu_8p_0001
test_ms_wide_deep_criteo_ps_train_check_perf_910_gpu_1p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from models
2.cd models/official/cv/DeepLabV3P/script
3.bash run_distribute_train_s16_r1_gpu.sh /PATH/TO/MINDRECORD_NAME /PATH/TO/PRETRAIN_MODEL
4.验证网络是否训练成功,性能达标,无过多告警。

Describe the expected behavior / 预期结果 (Mandatory / 必填)

训练成功,性能达标,无过多告警。

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

走给项敏珊

评论 (5)

魏鑫 创建了Bug-Report
魏鑫 添加了
 
kind/bug
标签
魏鑫 添加了
 
attr/function
标签
魏鑫 添加了
 
stage/func-debug
标签
魏鑫 添加了
 
v2.3.0
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@魏鑫

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
魏鑫 修改了描述
魏鑫 修改了描述
魏鑫 修改了标题
xiangminshan 负责人xiangminshan 修改为fary86
魏鑫 修改了标题
魏鑫 修改了标题
魏鑫 修改了标题
魏鑫 修改了标题

原因分析

当前代码里直接传的是空串,mindspore/ccsrc/kernel/format_utils.cc中的函数Format GetFormatFromStrToEnum(const std::string &format_str)没有对空串做特殊处理
当前产生warnning的调用栈为

0. mindspore/ccsrc/kernel/format_utils.cc `Format GetFormatFromStrToEnum(const std::string &format_str)`
1. mindspore/ccsrc/kernel/kernel.cc `KernelTensor::KernelTensor(void *device_ptr, size_t size, const std::string &format, TypeId dtype_id, ...`
2. mindspore/ccsrc/runtime/device/device_address_utils.cc `void DeviceAddressUtils::CreateKernelWorkspaceDeviceAddress(const DeviceContext *device_context, const KernelGraphPtr &graph)` 中调用 `auto kernel_tensor = std::make_shared<kernel::KernelTensor>(nullptr, workspace_sizes[i], "", kTypeUnknown, ShapeVector(), ...`
fary86 里程碑B-SolutionTest 修改为B-SIG-AKG
fary86 添加协作者fary86
fary86 负责人fary86 修改为huoxinyou
魏鑫 修改了标题
i-robot 添加了
 
gitee
标签
huoxinyou 添加协作者huoxinyou
huoxinyou 负责人huoxinyou 修改为魏鑫
huoxinyou 添加了
 
rca/inf/msg
标签
huoxinyou 添加了
 
rct/bugfix
标签
huoxinyou 添加了
 
ctl/componenttest
标签
huoxinyou 里程碑B-SIG-AKG 修改为B-SolutionTest
huoxinyou 任务状态TODO 修改为VALIDATION

回归时间:2023/12/26
回归步骤:参考issue步骤
回归版本:2.3
回归结果:
输入图片说明
回归结论:回归通过

i-robot 添加了
 
foruda
标签
魏鑫 任务状态VALIDATION 修改为DONE
魏鑫 添加了
 
sig/akg
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6568201 fary86 1584438549 11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

53164aa7 5694891 3bd8fe86 5694891