冻结图后的 Tensorflow OOM [英] Tensorflow OOM after freeze graph

查看:24
本文介绍了冻结图后的 Tensorflow OOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 tf 运行 seq2seq 模型,当使用 tf.train.Saver 从检查点文件加载参数时,推理程序运行良好.但是在使用 freeze_graph.py(使用 tf.framework.graph_util.convert_variables_to_constants())导出图形后,并使用 tf.import_graph_def 导入推理程序,出现了OOM问题.

I'm running a seq2seq model with tf, the inference program runs well when loading parameters from checkpoint file using tf.train.Saver. But after exporting the graph with freeze_graph.py (using tf.framework.graph_util.convert_variables_to_constants()), and import with tf.import_graph_def in the inference program, it got OOM problem.

这是错误日志的一部分:

Here is a part of error log:

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ****************************************************************************************************
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 4.0KiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:983] Internal: Dst tensor is not initialized.
E tensorflow/core/common_runtime/executor.cc:594] Executor failed to create kernel. Internal: Dst tensor is not initialized.
     [[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
  File "inference.py", line 88, in console_main
    result = list(inference(source_sentence))
  File "inference.py", line 54, in inference
    for sequence in result:
  File "/data/experiment/decoder.py", line 115, in search_best_sequence
    State.batch_predict(self.session, self.model, self.context, beam)
  File "/data/experiment/decoder.py", line 82, in batch_predict
    state_list[0].depth)
  File "/data/experiment/seq2seq_model.py", line 452, in batch_feed_decoder
    log_softmax, attns, state = session.run(output_fetch, input_feed)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 966, in _run
    feed_dict_string, options, run_metadata)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1016, in _do_run
    target_list, options, run_metadata)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1036, in _do_call
    raise type(e)(node_def, op, message)
InternalError: Dst tensor is not initialized.
     [[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0', defined at:
  File "inference.py", line 169, in <module>
    tf.app.run()
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "inference.py", line 165, in main
    console_main(session)
  File "inference.py", line 66, in console_main
    model = create_model(session, False)
  File "/data/experiment/model.py", line 145, in create_model
    tensor_name_pickle=tensor_name_pickle)
  File "/data/experiment/seq2seq_model.py", line 106, in __init__
    tf.import_graph_def(graph_def, name="")
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 287, in import_graph_def
    op_def=op_def)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/.conda/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
    self._traceback = _extract_stack()

InternalError (see above for traceback): Dst tensor is not initialized.
     [[Node: embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/AttnV_0 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [1024] values: -0.016628871 -0.2054652 -0.045054652...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

我认为这可能是由 tf.Constant 的内存问题引起的.有人遇到过这个问题吗?

I thought it might cause by the memory issue of tf.Constant. Does someone have experience with this problem?

推荐答案

我遇到了同样的问题,但是在尝试从使用 C API 的 C++ 应用程序加载和运行推理时.经过大量的尝试和测试,似乎罪魁祸首是冻结图和 freeze_graph.py 本身.这可能是某种错误.实际上在 github 的 TF repo 上有多个问题报告,但由于缺乏活动,它们刚刚关闭,例如这里此处.我想模型冻结的明显错误不是任何优先事项.

I had the same issue but when trying to load and run the inference from a C++ application using the C API. After a lot of twiddling and testing it appeared the culprit was the frozen graph and freeze_graph.py itself. It's probably a bug of some kind. There are actually multiple issue reports on github's TF repo, but they were just closed due to lack of activity, e.g. here and here. I guess apparent bugs of model freezing aren't of any priority.

在我的例子中,模型 .pb 文件大约 500mb,运行会话时它占用了大约 10Gb 的 RAM.它不仅占用了大量的 RAM,而且实际上慢了几个数量级.

In my case the model .pb file was around 500mb and it took around 10Gb of RAM while running a session. Not only did it occupy an insane amount of RAM, it was actually orders of magnitudes slower that way.

当我切换到只加载 SavedModel 目录时,一切都正常了.我不确定如何在 python 中实现这一点,但是对于 C 代码,我用 TF_LoadSessionFromSavedModel() 替换了 TF_GraphImportGraphDef() 调用.

When I switched to loading just a SavedModel directory everything went to normal. I'm not sure how to achieve that in python, but for C code I replaced a TF_GraphImportGraphDef() call with TF_LoadSessionFromSavedModel().

我使用的是 TF v1.14.0.该库是我用 Bazel 构建的,而不是股票版本.如果有人感兴趣,我可以在这里和那里提供一些细节.只是不知道从哪里开始,我有很多尝试和错误.

I used TF v1.14.0. The library is built with Bazel by me, not the stock version. I could provide some details here and there if anybody was interested. Just not sure where to start, I had many trials and errors.

这篇关于冻结图后的 Tensorflow OOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆