TensorFlow Google Cloud ML培训中的内存泄漏 [英] Memory Leak in TensorFlow Google Cloud ML Training

查看：109 发布时间：2020/5/8 21:19:54 memory-leaks tensorflow google-cloud-ml google-cloud-ml-engine

本文介绍了TensorFlow Google Cloud ML培训中的内存泄漏的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试在Google Cloud ML上使用TensorFlow教程脚本. 特别是，我在 https://上使用了cifar10 CNN教程脚本github.com/tensorflow/models/tree/master/tutorials/image/cifar10 .

I've been trying the TensorFlow tutorial scripts on Google Cloud ML. In particular I've used the cifar10 CNN tutorial scripts at https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10.

当我在Google Cloud ML中运行此训练脚本时，每小时大约有0.5％的内存泄漏.

When I run this training script in Google Cloud ML, there is a memory leak of around 0.5% per hour.

除了将脚本打包成所需的GCP格式外，我没有对脚本进行任何更改(如 https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer )，然后将数据位置设置为包含.bin数据文件.

I have not made any changes to the scripts other than packaging them into the required GCP format (as described in https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer) and setting the data location to the storage bucket containing the .bin data files.

如果我在本地运行(即不在Google Cloud中运行)，则并使用TCMALLOC ，方法是设置LD_PRELOAD ="/usr/lib/libtcmalloc.so"，以解决内存泄漏问题. 但是，我在Google Cloud ML中没有此选项.

If I run locally i.e. not in Google Cloud, and use TCMALLOC, by setting LD_PRELOAD="/usr/lib/libtcmalloc.so", the memory leak is resolved. However, I do not have this option with Google Cloud ML.

什么可能导致泄漏，我该怎么办才能解决此问题?为什么其他用户没有注意到相同的问题? 尽管泄漏很小，但是当我针对自己的数据运行几天时，它足以导致训练课程内存不足并失败. 无论我使用多少GPU，都会发生泄漏.

What could be causing the leak, and what can I do to fix this? Why aren't other users noticing the same problem? Although the leak is small, it is big enough to cause my training sessions to run out of memory and fail, when I run against my own data for several days. The leak happens regardless of the number of GPUs I use.

我使用的gcloud命令是:

The gcloud command I used is:

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

配置文件(config.yml)为:

The config file (config.yml) is:

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu

任何帮助表示赞赏，谢谢.

Any help appreciated, thanks.

TensorFlow Google Cloud ML培训中的内存泄漏 [英] Memory Leak in TensorFlow Google Cloud ML Training

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

TensorFlow Google Cloud ML培训中的内存泄漏 [英] Memory Leak in TensorFlow Google Cloud ML Training

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭