GPU PoolAllocator爆炸了CPU内存 [英] GPU PoolAllocator explodes the CPU memory

查看:171
本文介绍了GPU PoolAllocator爆炸了CPU内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用一个相对通用的操作(除了几个tf.where和索引处理)制作了一个张量流模型,但是用非常不同的不同输入形状(模型中许多未定义的张量形状)来称呼它.

一切在CPU上都能正常工作. 但是,当您使用GPU 时,RAM的使用量(而不是GPU内存,而不是CPU)会稳定增加,以填充计算机的256GB内存并杀死自己.

在此过程中,我得到了通常的消息:

2017-03-17 16:42:22.366601: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 18347 get requests, put_count=18345 evicted_count=1000 eviction_rate=0.0545108 and unsatisfied allocation rate=0.0763068
2017-03-17 16:42:22.366680: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 4385 to 4823

据我所知,这是GPU的某些DMA内存的池分配器.问题在于,它似乎永远无法满足其获得的逐出率,也永远不会为自己分配更多的空间.

这是正常行为吗?他们有办法控制吗?现在,在内存不足之前,我不能训练模型超过1小时.

注意:由于当前模型运行需要一些错误修正,因此我使用了TF的灵活构建版本.另外,在训练过程中没有添加任何操作,因为我叫tf.get_default_graph().finalize()

尝试使用tcmalloc而不是malloc运行.没有帮助.我还使用了内存探查器,这并不是说内存泄漏,即使top中的内存使用率更高并且程序最终运行OOM,tcmalloc的内存使用量也稳定在500MB. 那么为什么tcmalloc分析器与我在top中看到的内存使用情况不一致?

使用更改的硬编码参数重新编译了TF,以使其正常运行".请参见此处

解决方案

TF团队在更改其内存分配器之前解决了这个特定问题(请参见 graph.finalize() ,以确保没有在训练过程中添加了节点,这可以捕获许多内存增长问题.

I made a tensorflow model with relatively common operations (apart from a couple of tf.where and indices handling), but call it with very varying different input shapes (many undefined tensor shapes in the model).

Everything works fine on the CPU. But when you use the GPU, the RAM usage (not the GPU memory, the CPU one) steadily increases up to fill the 256GB of the machine and kills itself.

During the process, I get the usual messages :

2017-03-17 16:42:22.366601: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 18347 get requests, put_count=18345 evicted_count=1000 eviction_rate=0.0545108 and unsatisfied allocation rate=0.0763068
2017-03-17 16:42:22.366680: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 4385 to 4823

Which as far as I understand is the pool allocator for some DMA memory for the GPU. The problem is that it seems to never be satisfied with the eviction rate it gets and never ends allocating more space for itself.

Is this normal behavior? Are they ways to control this? Right now, I can not train a model for longer than 1h before running out of memory.

Note: I use the nigthly build version of TF, because of some bugfixes necessary for my current model to run. Also, no operations are added during training because I called tf.get_default_graph().finalize()

EDIT : tried to run with tcmalloc instead of malloc. Did not help. I also used the memory profiler and it is not saying there is a memory leak, memory usage stabilizing at 500MB for tcmalloc even if the memory usage in top is way higher and the program eventually run OOM. So why is the tcmalloc profiler not agreeing with the memory usage I see in top?

EDIT 2 : recompiled TF with changed hardcoded params to make it "work". See here

解决方案

This specific problem was solved some times ago by the TF team when they changed their memory allocator (see the Corresponding issue on github).

If you encounter a growth in memory during training, a common mistake is that nodes are being added to the graph during the training (TF is not numpy, unless you use eager execution). Make sure to call graph.finalize() before your training loop to ensure no nodes are added during the training process, this allows to catch many memory growth issues.

这篇关于GPU PoolAllocator爆炸了CPU内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆