TensorFlow Google Cloud ML培训中的内存泄漏 [英] Memory Leak in TensorFlow Google Cloud ML Training

查看:109
本文介绍了TensorFlow Google Cloud ML培训中的内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试在Google Cloud ML上使用TensorFlow教程脚本. 特别是,我在 https://上使用了cifar10 CNN教程脚本github.com/tensorflow/models/tree/master/tutorials/image/cifar10 .

I've been trying the TensorFlow tutorial scripts on Google Cloud ML. In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.

当我在Google Cloud ML中运行此训练脚本时,每小时大约有0.5%的内存泄漏.

When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.

除了将脚本打包成所需的GCP格式外,我没有对脚本进行任何更改(如 https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer ),然后将数据位置设置为包含.bin数据文件.

I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.

如果我在本地运行(即不在Google Cloud中运行),则并使用TCMALLOC ,方法是设置LD_PRELOAD ="/usr/lib/libtcmalloc.so",以解决内存泄漏问题. 但是,我在Google Cloud ML中没有此选项.

If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved. However, I do not have this option with Google Cloud ML.

什么可能导致泄漏,我该怎么办才能解决此问题?为什么其他用户没有注意到相同的问题? 尽管泄漏很小,但是当我针对自己的数据运行几天时,它足以导致训练课程内存不足并失败. 无论我使用多少GPU,都会发生泄漏.

What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem? Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days. The leak happens regardless of the number of GPUs I use.

我使用的gcloud命令是:

The gcloud command I used is:

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

配置文件(config.yml)为:

The config file (config.yml) is:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

任何帮助表示赞赏, 谢谢.

Any help appreciated, thanks.

推荐答案

我们建议使用以下版本的代码:

We recommend using this version of the code:

github.com/tensorflow/models/pull/1538

github.com/tensorflow/models/pull/1538

具有性能上的优势(通过减少运行时间,您不大可能出现OOM).

Which has performance benefits (by running for less time, you're less prone to OOMs).

那当然不是永久性的解决方案,但是,根据我们的测试,TensorFlow 1.2似乎可以解决该问题. TensorFlow 1.2即将在CloudML Engine上可用.如果您仍然遇到问题,请告诉我们.

That of course, may not be the permanent fix, however, according to our testing, TensorFlow 1.2 appears to address the issue. TensorFlow 1.2 will be available soon on CloudML Engine. If you continue to have problems, please let us know.

这篇关于TensorFlow Google Cloud ML培训中的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆