原因分析

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[resnext50/vgg16][GPU ]网络训练告警日志过多导致用例失败
模型仓地址：https://gitee.com/mindspore/models/tree/master/official/cv/DeepLabV3P/scripts

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device GPU

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
mindspore版本：2.2.10.20231125，commit_id = '[sha1]:1e6bd3d7,[branch]:(HEAD,origin/r2.2.10,r2.2.10)'
run：Milan_C15/20231122
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_alexnet_cifar10_train_infer_910_gpu_1p_0001
test_ms_asr_dynamic_shape_an4_32_train_check_loss_gpu_1p_0001
test_ms_bert_finetune_ner_softmax_cluener_train_infer_0002
test_ms_bert_large_cn_news_pynative_train_check_loss_gpu_8p_0001
test_ms_bert_large_cn_news_pynative_train_check_perf_gpu_1p_0001
test_ms_cyclegan_cityscapes_ascend_gpu_cpu_train_check_loss_0002
test_ms_dbnet_r18_icdar2015_gpu_check_loss_8p_0005
test_ms_dbnet_r50_icdar2015_gpu_check_loss_8p_0011
test_ms_deeplabv3_vocaug_cpu_train_check_loss_0001
test_ms_deeplabv3plus_s16_gpu_check_fps_1p_0001
test_ms_deeplabv3plus_s16_gpu_check_loss_8p_0003
test_ms_deeplabv3plus_s16_pynative_train_check_loss_gpu_8p_0003
test_ms_deeplabv3plus_s8_gpu_check_fps_1p_0002
test_ms_deeplabv3plus_s8_gpu_check_loss_8p_0004
test_ms_deeplabv3plus_s8_pynative_train_check_loss_gpu_8p_0004
test_ms_dqn_train_infer_0001
test_ms_efficientnet_cifar10_cpu_train_check_loss_daily_0001
test_ms_efficientnet_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_efficientnet_imagenet2012_train_check_loss_gpu_8p_0001
test_ms_efficientnetb3_imagenet2012_gpu_check_fps_1p_0001
test_ms_inceptionv3_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_inceptionv3_imagenet2012_train_check_fps_gpu_1p_0004
test_ms_inceptionv3_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_lstm_aclimdb_train_infer_gpu_1p_0001
test_ms_mobilenetv1_cifar10_cpu_train_check_loss_0001
test_ms_mobilenetv1_imagenet_gpu_check_fps_1p_0004
test_ms_mobilenetv1_imagenet_gpu_train_check_loss_4p_0003
test_ms_mobilenetv2_garbage_cpu_train_infer_0003
test_ms_mobilenetv2_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_mobilenetv3_cpu_check_fps_1p_0001
test_ms_mobilenetv3_imagenet2012_gpu_check_loss_8p_0003
test_ms_pangu_alpha_gpu_train_8p_0001
test_ms_ppo_train_infer_200_episode_gpu_cpu_1p_0001
test_ms_resnet101_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet18_cifar10_pynative_train_check_perf_gpu_1p_0001
test_ms_resnet18_cifar10_pynative_train_infer_gpu_8p_0001
test_ms_resnet18_cifar10_train_check_fps_gpu_1p_0002
test_ms_resnet18_cifar10_train_infer_gpu_8p_0001
test_ms_resnet18_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet18_imagenet_train_check_pfs_gpu_1p_0002
test_ms_resnet50_benchmark_imagenet_pynative_train_check_perf_gpu_8p_0001
test_ms_resnet50_cifar10_pynative_train_check_perf_gpu_1p_0001
test_ms_resnet50_cifar10_pynative_train_infer_gpu_8p_0001
test_ms_resnet50_cifar10_train_check_loss_cpu_0001
test_ms_resnet50_cifar10_train_check_loss_gpu_8p_0002
test_ms_resnet50_cifar10_train_check_perf_gpu_1p_0003
test_ms_resnet50_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_resnet50_imagenet_train_check_loss_gpu_8p_0002
test_ms_resnet50_imagenet_train_check_perf_gpu_1p_0003
test_ms_resnext50_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_resnext50_imagenet2012_train_check_loss_gpu_8p_0001
test_ms_retinaface_resnet50_widerface_pynative_train_check_loss_gpu_4p_0001
test_ms_retinaface_resnet50_widerface_train_check_loss_gpu_8p_0001
test_ms_retinaface_resnet50_widerface_train_check_perf_gpu_1p_0001
test_ms_retinaface_resnet50_widerface_train_no_resume_check_loss_gpu_8p_0001
test_ms_shufflenetv1_gpu_check_fps_1p_0001
test_ms_shufflenetv2_imagenet2012_pynative_train_check_loss_gpu_8p_0001
test_ms_shufflenetv2_imagenet2012_train_check_fps_gpu_1p_0004
test_ms_shufflenetv2_imagenet2012_train_check_loss_gpu_8p_0005
test_ms_ssd_helmet_cpu_train_check_loss_0001
test_ms_ssd_mobilenetv1_fpn_coco2017_train_check_fps_gpu_0002
test_ms_ssd_mobilenetv1_fpn_coco2017_train_check_loss_gpu_8p_0003
test_ms_ssd_resnet50_fpn_coco2017_pynative_train_check_loss_gpu_8p_0001
test_ms_ssd_resnet50_fpn_coco2017_train_check_fps_gpu_0002
test_ms_ssd_resnet50_fpn_coco2017_train_check_loss_gpu_8p_0003
test_ms_ssd_vgg16_coco2017_pynative_train_check_loss_gpu_8p_0001
test_ms_ssd_vgg16_coco2017_train_check_fps_gpu_0002
test_ms_ssd_vgg16_coco2017_train_check_loss_gpu_8p_0003
test_ms_unet_plus_gpu_train_infer_1p_0001
test_ms_unet_plus_gpu_train_infer_8p_0002
test_ms_usability_benchmark_graph_cpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_deeplabv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_inceptionv4_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_cpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_alexnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ctpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_deeplabv3_plus_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_dqn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_efficientnet_b3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ppo_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnet_18_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_resnext50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_retinaface_resnet50_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_shufflenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_shufflenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_mobilenetv1_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_resnet50_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_ssd_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_graph_gpu_yolov5_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_deeplabv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_inceptionv4_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_cpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_alexnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ctpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_cyclegan_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_deeplabv3_plus_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_dqn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_efficientnet_b0_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_efficientnet_b3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_facerecognition_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_fasterrcnn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_inceptionv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_mobilenetv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ppo_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_101_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_18_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_resnext50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_retinaface_resnet50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_shufflenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_shufflenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_mobilenetv1_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_resnet50_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_ssd_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_transformer_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_yolov3_darknet53_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_gpu_yolov5_time_perf_loss_1p_0001
test_ms_vgg16_imagenet_check_loss_gpu_8p_0001
test_ms_vgg16_imagenet_perf_gpu_1p_0001
test_ms_vgg16_imagenet_pynative_train_check_loss_gpu_8p_0001
test_ms_wide_deep_criteo_host_device_mix_train_infer_gpu_8p_0001
test_ms_wide_deep_criteo_ps_train_check_perf_910_gpu_1p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from models
2.cd models/official/cv/DeepLabV3P/script
3.bash run_distribute_train_s16_r1_gpu.sh /PATH/TO/MINDRECORD_NAME /PATH/TO/PRETRAIN_MODEL
4.验证网络是否训练成功，性能达标，无过多告警。

Describe the expected behavior / 预期结果 (Mandatory / 必填)

训练成功，性能达标，无过多告警。

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

走给项敏珊

Please assign maintainer to check this issue.
请为此issue分配处理人。
@魏鑫

感谢您的反馈，您可以评论//mindspore-assistant更快获取帮助，更多标签可以查看标签列表：

如果您刚刚接触MindSpore，或许您可以在教程找到答案
如果您是资深Pytorch用户，您或许需要:
与PyTorch典型区别 / PyTorch与MindSpore API映射表
如果您遇到动态图问题，可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
模型精度调优问题可参考官网调优指南
如果您反馈的是框架BUG，请确认您在ISSUE中提供了MindSpore版本、使用的后端类型（CPU、GPU、Ascend）、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
如果您已经定位出问题根因，欢迎提交PR参与MindSpore开源社区，我们会尽快review

原因分析

当前代码里直接传的是空串，mindspore/ccsrc/kernel/format_utils.cc中的函数Format GetFormatFromStrToEnum(const std::string &format_str)没有对空串做特殊处理
当前产生warnning的调用栈为

0. mindspore/ccsrc/kernel/format_utils.cc `Format GetFormatFromStrToEnum(const std::string &format_str)`
1. mindspore/ccsrc/kernel/kernel.cc `KernelTensor::KernelTensor(void *device_ptr, size_t size, const std::string &format, TypeId dtype_id, ...`
2. mindspore/ccsrc/runtime/device/device_address_utils.cc `void DeviceAddressUtils::CreateKernelWorkspaceDeviceAddress(const DeviceContext *device_context, const KernelGraphPtr &graph)` 中调用 `auto kernel_tensor = std::make_shared<kernel::KernelTensor>(nullptr, workspace_sizes[i], "", kTypeUnknown, ShapeVector(), ...`

原因如上，已将日志修改为debug
!62616:修复format日志过多问题

回归时间：2023/12/26
回归步骤：参考issue步骤
回归版本：2.3
回归结果：
输入图片说明
回归结论：回归通过

GVP MindSpore / mindspore

内容风险标识

[ST][MS][NET][deeplabv3plus-s16/dbnet_r18/resnet50/inceptionv3/cyclegan等网络][GPU ]网络训练告警日志过多导致用例失败，GetFormatFromStrToEnum]The data format can not be converted to enum

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (5)

原因分析

GVPMindSpore / mindspore

内容风险标识

[ST][MS][NET][deeplabv3plus-s16/dbnet_r18/resnet50/inceptionv3/cyclegan等网络][GPU ]网络训练告警日志过多导致用例失败，GetFormatFromStrToEnum]The data format can not be converted to enum

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (5)

原因分析

搜索帮助

GVP MindSpore / mindspore