2.4K Star 8.1K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][transformer-dynamic-shape][pynaitve][910 8p]训练日志有error日志

DONE
Bug-Report
创建于  
2023-11-09 16:21
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

transformer-dynamic-shape网络pynative模式在910环境8p训练,训练日志有error日志

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910A

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    Run包:Milan_C13/20231024/
    Mindspore版本:r2.3_20231108021522_753c9f688

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址: solution_test/cases/02network/02nlp/transformer_dynamic/pynative
用例:test_ms_transformer_dynamic_shape_wmt_englisth_german_pynative_train_check_func_910_8p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1、get code from modelinternal
2、cd modelinternal/internal_case/transformer_dynamic
3、set pynative mode in train.py, set epoch_size=3 in default_config_large.yaml
4、sh scripts/run_distribute_train_ascend.sh WMT-English-German-Dynamic/database ./default_config_large.yaml
5、验证网络是否训练成功,检查./run_distribute_train/ckpt_0/transformer-3_17573.ckpt是否生成,并配置model_file
6、python eval.py --device_target=Ascend --config_path=./default_config_large.yaml
7、sh scripts/process_output.sh ./WMT-English-German/data/newstest2014.tok.de ./output_eval.txt ./WMT-English-German/data/vocab.bpe.32000
8、perl multi-bleu.perl ./WMT-English-German/data/newstest2014.tok.de.forbleu < ./output_eval.txt.forbleu
9、验证推理精度是否能达到23

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] FE(127810,python):2023-11-09-01:32:05.521.100 [get_attr_by_type.cc:73]135155 GetIntAttrValue:"[SubGraphOpt][PreCompileOp] Op [Cast] get int attr [dst_type] value failed."
[ERROR] FE(127810,python):2023-11-09-01:32:05.521.143 [tbe_info_assembler.cc:1578]135155 AssembleTbeInfo:"[SubGraphOpt][PreCompileOp][AssembleInput][Op Cast, type Cast]: failed to feedAttrsToTbeOpInfo."

Special notes for this issue/备注 (Optional / 选填)

走给蔡福壁

评论 (7)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
v2.3.0
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
xiangminshan 修改了描述
zhongjicheng 负责人设置为caifubi
zhongjicheng 里程碑设置为B-SIG-PYNATIVE
zhongjicheng 修改了描述

该问题影响了50个用例,建议升级严重单。
test_ms_bert_large_boost_en_wiki_pynative_train_check_func_910_1p_0001
test_ms_bert_large_cn_news_pynative_train_check_perf_910_1p_0001
test_ms_dbnet_mobilenetv3_icdar2015_pynative_train_check_perf_910_gpu_1p_0001
test_ms_deeplabv3_vocaug_s16_pynative_train_check_loss_910_8p_0001
test_ms_inceptionv3_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_inceptionv4_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_pynative_dbnet_r18_icdar2015_ascend_check_fps_1p_0001
test_ms_pynative_dbnet_r50_icdar2015_ascend_check_fps_1p_0003
test_ms_resnet101_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_resnet18_cifar10_pynative_train_check_perf_910_1p_0001
test_ms_resnet18_cifar10_pynative_train_infer_910_8p_0001
test_ms_resnet18_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_resnet50_cifar10_pynative_train_check_perf_910_1p_0001
test_ms_resnet50_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_resnext50_imagenet2012_pynative_train_check_loss_910_8p_0001
test_ms_retinanet_coco2017_pynative_train_check_loss_910_8p_0001
test_ms_se_resnet50_imagenet_pynative_train_check_loss_910_8p_0001
test_ms_ssd_resnet50_fpn_coco2017_pynative_train_check_loss_910_8p_0001
test_ms_ssd_vgg16_coco2017_pynative_train_check_loss_910_8p_0001
test_ms_unet2d_pengcheng_luna_ascend_train_infer_1p_0003
test_ms_usability_benchmark_pynative_ascend_alexnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_ctpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_deepfm_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_deeplabv3_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_deeptext_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_dqn_time_perf_1p_0001
test_ms_usability_benchmark_pynative_ascend_facerecognition_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_fasterrcnn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_inceptionv4_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_lstm_sentimentnet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_mobilenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_mobilenetv2_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_resnet_101_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_resnet_18_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_resnet_50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_resnext50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_retinanet_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_se_resnet50_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_shufflenetv1_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_ssd_mobilenetv1_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_ssd_resnet50_fpn_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_ssd_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_ssd_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_transformer_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_u_net3d_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_unet2d_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_vgg16_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_wide_deep_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_ascend_yolov4_time_perf_loss_1p_0001
test_ms_usability_benchmark_pynative_unet_plus_time_perf_loss_1p_0001

问题影响用例面积大,优先级提升到严重

zhongjicheng 优先级次要 修改为严重
caifubi 添加协作者caifubi
caifubi 负责人caifubi 修改为linqingke
caifubi 里程碑B-SIG-PYNATIVE 修改为B-SIG-ASCEND
linqingke 任务状态TODO 修改为VALIDATION
linqingke 添加协作者linqingke
linqingke 负责人linqingke 修改为zhongjicheng
linqingke 里程碑B-SIG-ASCEND 修改为B-SolutionTest
linqingke 添加了
 
ctl/solutiontest
标签
linqingke 添加了
 
rca/inf/msg
标签
linqingke 添加了
 
rct/refactor
标签

根因:海思cast算子需要的属性类型变为int,ms需要适配;
解决方案:将ge adapter里的属性类型强转为int;

#Appearance & Root Cause
海思cast算子需要的属性类型变为int,ms需要适配;
#Fix Solution
将ge adapter里的属性类型强转为int;
relevant pr:
https://e.gitee.com/mind_spore/repos/mindspore/mindspore/pulls/61849

Self-test Report & DT Review
是否需要补充ST/UT:否

i-robot 添加了
 
gitee
标签

回归版本:
runpkg_version:Milan_C17/20240124

mindspore:r2.3_20240124223721_b30b274f4143
回归步骤:参考issue复现步骤
基本功能:网络pynative跑不起来

测试结论:问题由issue #I8Z8LW:[ST][doc][教程][910A][pynative][ssd/generative_diffusion/vision_transformer]等教程在910上pynative模式报错ValueNode<ValueTuple> ()anf_node is not CNode 跟踪
回归人员:zhongjicheng
回归时间: 2024-01-27

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6575151 linqingke 1584444037 6574868 jojohw 1584546516
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

53164aa7 5694891 3bd8fe86 5694891