76 Star 220 Fork 167

Ascend / modelzoo

 / 详情

[华师大]持久化操作时找不到保存路径

DONE
Consultation
创建于  
2020-11-09 21:14

持久化操作保存checkpoint到运行文件所在目录下的pre_trained文件夹中,运行文件中我将持久化保存文件夹路径设置为参数输入,算法中参数设置如下图
参数设置
代码目录如下图:(运行文件为ex_acm3025.py)
代码目录
但是运行过程中可以找到load_data文件夹并能读取文件成功,可是报错提示找不到pre_trained文件夹,具体报错如下:

Traceback (most recent call last):
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(args)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: /home/work/modelarts/outputs/pre_trained; No such file or directory
[[{{node save/SaveV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1176, in save
{self.saver_def.filename_tensor_name: checkpoint_file})
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: /home/work/modelarts/outputs/pre_trained; No such file or directory
[[node save/SaveV2 (defined at usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'save/SaveV2':
File "home/work/user-job-dir/HAN-master/ex_acm3025.py", line 180, in
saver = tf.train.Saver()
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 828, in init
self.build()
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
self._build(self._filename, build_save=True, build_restore=True)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
build_restore=build_restore)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 505, in _build_internal
save_tensor = self._AddSaveOps(filename_tensor, saveables)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 206, in _AddSaveOps
save = self.save_op(filename_tensor, saveables)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 122, in save_op
tensors)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1946, in save_v2
name=name)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(args, kwargs)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self.traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/work/user-job-dir/HAN-master/ex_acm3025.py", line 257, in
saver.save(sess, checkpt_file)
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1193, in save
raise exc
**ValueError: Parent directory of /home/work/modelarts/outputs/pre_trained/acm_allMP_multi_fea
.ckpt doesn't exist, can't save.**
2020-11-09 12:11:47.501699: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin
2020-11-09 12:11:47.501756: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.
2020-11-09 12:11:47.501765: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

请问modelarts平台下该如何设置保存路径的参数呢,能否针对我的这个代码目录给出具体的参数设置方法,如需要具体代码信息我可以提供

附件
main.py(5.90 KB)下载
zhutian 2020-11-10 11:51

评论 (10)

GKsama 创建了Consult
GKsama 关联仓库设置为Ascend/modelzoo
GKsama 修改了描述
GKsama 修改了描述
GKsama 修改了描述
展开全部操作日志

您的报错日志显示是使用TF原生接口加载预训练模型文件时,读取文件路径不合法
该路径/home/work/modelarts/outputs/pre_trained 在后台训练服务器上找不到文件
通常从OBS拷贝过去的文件默认在ModelArts后台训练镜像容器的/home/work/user-job-dir目录下,所以猜测您实际上未将预训练模型拷贝过来,请您排查一下,谢谢!

您的报错日志显示是使用TF原生接口加载预训练模型文件时,读取文件路径不合法
该路径/home/work/modelarts/outputs/pre_trained 在后台训练服务器上找不到文件
通常从OBS拷贝过去的文件默认在ModelArts后台训练镜像容器的/home/work/user-job-dir目录下,所以猜测您实际上未将预训练模型拷贝过来,请您排查一下,谢谢!

@zhengtao 请问在执行作业时该如何将obs中的文件拷贝至ModelArts后台训练镜像容器的/home/work/user-job-dir的目录下,目前在执行作业时指定训练输出目录仍然无效,下图为最新一次作业的输出目录映射参数
作业输出映射
训练jobid: trainjob-3dcd | job84a2614f
目前本地代码目录中的所有文件已经上传至obs,我该如何知道我是否将与训练模型成功拷贝至训练镜像容器的目录下,或者能否给出设置输出路径时的一般步骤和需要设置的参数,麻烦给出解答!

tensorflow.python.framework.errors_impl.NotFoundError: /home/work/modelarts/outputs/pre_trained; No such file or directory
这个报错说明你传到训练脚本里面的checkpoint路径是/home/work/modelarts/outputs/pre_trained,这个路径是你自己设置的么?

Original stack trace for 'save/SaveV2':
File "home/work/user-job-dir/HAN-master/ex_acm3025.py", line 180, in
saver = tf.train.Saver()
这段日志,说明你的ex_acm3025.py文件在modelarts上路径为home/work/user-job-dir/HAN-master/,你可以根据这个来大致推断你的工程拷贝到modelarts上后的目录路径和结构。所以这个pre_trained路径会不会是home/work/user-job-dir/HAN-master/目录下呢

也可以使用mox拷贝:
增加运行参数checkpoint_url为模型的存放位置,例如s3://ms-course(桶名称)/deeplabv3_example/checkpoint/。
脚本需对传参进行解析后赋值到args_opt变量里,在后续代码里可以使用。

parser = argparse.ArgumentParser(description="deeplabv3 training")
parser.add_argument('--checkpoint_url', default=None, help='Checkpoint path')
args_opt = parser.parse_args()

MindSpore暂时没有提供直接访问OBS数据的接口,需要通过ModelArts自带的moxing框架与OBS交互。将OBS中存储的数据集和Checkpoint拷贝至执行容器:

import moxing as mox

mox.file.copy_parallel(src_url=args_opt.checkpoint_url, dst_url='checkpoint/')

模型训练使用的是拷贝至执行容器中的数据集和Checkpoint:

data_path = "./voc2012"
train_checkpoint_path = "./checkpoint/deeplabv3_train_14-1_1.ckpt" #预训练的ckpt

具体使用可以参考:main.py

如需将训练输出(如模型Checkpoint文件)从执行容器拷贝至OBS,请参考:

dst_url形如's3://OBS/PATH',将Checkpoint拷贝至OBS后,可在OBS的args_opt.train_url目录下看到Checkpoint
import moxing
moxing.file.copy_parallel(src_url='checkpoint_deeplabv3-6_732.ckpt',
dst_url=os.path.join(args_opt.train_url, 'checkpoint_deeplabv3-6_732.ckpt'))

zhutian 上传了附件main.py

tensorflow.python.framework.errors_impl.NotFoundError: /home/work/modelarts/outputs/pre_trained; No such file or directory
这个报错说明你传到训练脚本里面的checkpoint路径是/home/work/modelarts/outputs/pre_trained,这个路径是你自己设置的么?

@zhutian
是这样的,该项目下的main文件中我写了3个传入参数(如下图1)。
其中--loaddata为数据集加载,通过设置算法时的输入通道设置其映射路径为/home/work/modelarts/inputs/load_data(此路径是平台生成的图2),并且在创建训练作业时我也指定该参数映射到obs中相应的文件存放位置,并且这个文件夹和其中的数据是可以找到的,具体路径设置为/hannet/HAN-master/load_data/(图4)。
另一个参数是--pretrain用来指定存放checkpoint的位置(具体ckpt路径见图3),同样我在设置算法时使用输出通道设置映射路径为/home/work/modelarts/outputs/pre_trained(图2),训练作业参数映射为/hannet/HAN-master/pre_trained/,但是在训练过程中却提示参数pretain的位置找不到。
最开始的问题是输出也找不到,我曾经尝试过使用os的有关指令遍历执行文件下的文件目录,但只得到了执行文件所在目录,其目录下并无其他文件或文件夹,所以我猜想数据可能存放在平台其他位置。直到设置了算法中输入通道后我们发现数据输入的问题解决了,却出现了模型保存位置找不到。所以说现在我们需要将输出路径设置为固定的绝对路径吗?
图1
图2
图3
图4
最后可以打开一下您提到的main.py文件的读权限吗,感谢!

这个main.py文件里面有直接从obs桶拷贝文件到modelarts运行环境:

import argparse
from mindspore import context
from mindspore.communication.management import init
from mindspore.nn.optim.momentum import Momentum
from mindspore import Model, ParallelMode
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.train.callback import Callback, CheckpointConfig, ModelCheckpoint, TimeMonitor
from src.md_dataset import create_dataset
from src.losses import OhemLoss
from src.deeplabv3 import deeplabv3_resnet50
from src.config import config
from src.miou_precision import MiouPrecision

parser = argparse.ArgumentParser(description="Deeplabv3 training")
parser.add_argument("--distribute", type=str, default="false", help="Run distribute, default is false.")
parser.add_argument('--data_url', required=True, default=None, help='Train data url')
parser.add_argument('--train_url', required=True, default=None, help='Train data output url')
parser.add_argument('--checkpoint_url', default=None, help='Checkpoint path')
args_opt = parser.parse_args()
print(args_opt)
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend") #无需指定DEVICE_ID

data_path = "./voc2012"
train_checkpoint_path = "./checkpoint/deeplabv3_train_14-1_1.ckpt" #预训练的ckpt
eval_checkpoint_path = "./checkpoint_deeplabv3-%s_732.ckpt" % config.epoch_size #训练结束存的ckpt

class LossCallBack(Callback):
"""
Monitor the loss in training.
Note:
if per_print_times is 0 do not print loss.
Args:
per_print_times (int): Print loss every times. Default: 1.
"""
def init(self, per_print_times=1):
super(LossCallBack, self).init()
if not isinstance(per_print_times, int) or per_print_times < 0:
raise ValueError("print_step must be int and >= 0")
self._per_print_times = per_print_times
def step_end(self, run_context):
cb_params = run_context.original_args()
print("epoch: {}, step: {}, outputs are {}".format(cb_params.cur_epoch_num, cb_params.cur_step_num,
str(cb_params.net_outputs)))
def model_fine_tune(flags, train_net, fix_weight_layer):
path = flags.checkpoint_url
if path is None:
return
path = train_checkpoint_path
param_dict = load_checkpoint(path)
load_param_into_net(train_net, param_dict)
for para in train_net.trainable_params():
if fix_weight_layer in para.name:
para.requires_grad = False

if name == "main":
if args_opt.distribute == "true":
context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, mirror_mean=True)
init()
args_opt.base_size = config.crop_size
args_opt.crop_size = config.crop_size

import moxing as mox
mox.file.copy_parallel(src_url=args_opt.data_url, dst_url='voc2012/')
mox.file.copy_parallel(src_url=args_opt.checkpoint_url, dst_url='checkpoint/')

# train
train_dataset = create_dataset(args_opt, data_path, config.epoch_size, config.batch_size, usage="train")
dataset_size = train_dataset.get_dataset_size()
time_cb = TimeMonitor(data_size=dataset_size)
callback = [time_cb, LossCallBack()]
if config.enable_save_ckpt:
    config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_steps,
                                 keep_checkpoint_max=config.save_checkpoint_num)
    ckpoint_cb = ModelCheckpoint(prefix='checkpoint_deeplabv3', config=config_ck)
    callback.append(ckpoint_cb)
net = deeplabv3_resnet50(config.seg_num_classes, [config.batch_size, 3, args_opt.crop_size, args_opt.crop_size],
                         infer_scale_sizes=config.eval_scales, atrous_rates=config.atrous_rates,
                         decoder_output_stride=config.decoder_output_stride, output_stride=config.output_stride,
                         fine_tune_batch_norm=config.fine_tune_batch_norm, image_pyramid=config.image_pyramid)
net.set_train()
model_fine_tune(args_opt, net, 'layer')
loss = OhemLoss(config.seg_num_classes, config.ignore_label)
opt = Momentum(filter(lambda x: 'beta' not in x.name and 'gamma' not in x.name and 'depth' not in x.name and 'bias' not in x.name, net.trainable_params()), learning_rate=config.learning_rate, momentum=config.momentum, weight_decay=config.weight_decay)
model = Model(net, loss, opt)
model.train(config.epoch_size, train_dataset, callback)

# eval
eval_dataset = create_dataset(args_opt, data_path, config.epoch_size, config.batch_size, usage="eval")
net = deeplabv3_resnet50(config.seg_num_classes, [config.batch_size, 3, args_opt.crop_size, args_opt.crop_size],
                         infer_scale_sizes=config.eval_scales, atrous_rates=config.atrous_rates,
                         decoder_output_stride=config.decoder_output_stride, output_stride=config.output_stride,
                         fine_tune_batch_norm=config.fine_tune_batch_norm, image_pyramid=config.image_pyramid)

param_dict = load_checkpoint(eval_checkpoint_path)
load_param_into_net(net, param_dict)
mIou = MiouPrecision(config.seg_num_classes)
metrics = {'mIou': mIou}
loss = OhemLoss(config.seg_num_classes, config.ignore_label)
model = Model(net, loss, metrics=metrics)
model.eval(eval_dataset)

你有跑modelarts的完整日志么,看里面能不能找到文件拷贝的日志,看是不是拷贝失败了

你有跑modelarts的完整日志么,看里面能不能找到文件拷贝的日志,看是不是拷贝失败了

@zhutian 最新一次的日志在这里

日志看不出来为啥。你直接提modelarts工单把,就说按照算法作业设置了路径,但是代码里面读取的时候获取不到对应的目录

zhengtao 负责人设置为zhutian

工单已提交,等待回复中,谢谢!

请问问题解决了么,谢谢

王位 任务状态ACCEPTED 修改为DONE
吴定远 关联仓库Ascend/modelzoo-his 修改为Ascend/modelzoo

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
1
https://gitee.com/ascend/modelzoo.git
git@gitee.com:ascend/modelzoo.git
ascend
modelzoo
modelzoo

搜索帮助