如何解释张量流中的 Poolallocator 消息? [英] How to interpret Poolallocator messages in tensorflow?

查看:18
本文介绍了如何解释张量流中的 Poolallocator 消息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在训练 tensorflow seq2seq 模型时,我看到以下消息:

<前>W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:27282个get请求后,put_count=9311 evicted_count=1000 eviction_rate=0.1074,未满足分配率=0.699032我 tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 100 提高到 110W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:13715个get请求后,put_count=14458 evicted_count=10000 eviction_rate=0.691659,未满足分配率=0.675684我 tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 110 提高到 121W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:6965个get请求后,put_count=6813 evicted_count=5000 eviction_rate=0.733891,未满足分配率=0.741421我 tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 133 提高到 146W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:44次get请求后,put_count=9058 evicted_count=9000 eviction_rate=0.993597,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:46次get请求后put_count=9062 evicted_count=9000 eviction_rate=0.993158,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:4次get请求后,put_count=1029 evicted_count=1000 eviction_rate=0.971817,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:2次get请求后,put_count=1030 evicted_count=1000 eviction_rate=0.970874,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:44次get请求后put_count=6074 evicted_count=6000 eviction_rate=0.987817,未满足分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:12次get请求后,put_count=6045 evicted_count=6000 eviction_rate=0.992556,未满足分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:2次get请求后,put_count=1042 evicted_count=1000 eviction_rate=0.959693,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:44次get请求后put_count=6093 evicted_count=6000 eviction_rate=0.984737,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:4次get请求后,put_count=1069 evicted_count=1000 eviction_rate=0.935454,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:17722个get请求后,put_count=9036 evicted_count=1000 eviction_rate=0.110668,未满足分配率=0.550615I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 792 提高到 871W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:6次get请求后,put_count=1093 evicted_count=1000 eviction_rate=0.914913,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:6次get请求后put_count=1101 evicted_count=1000 eviction_rate=0.908265,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:3224个get请求后,put_count=4684 evicted_count=2000 eviction_rate=0.426985,未满足分配率=0.200062I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 1158 提高到 1273W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:17794个get请求后,put_count=17842 evicted_count=9000 eviction_rate=0.504428,未满足分配率=0.510228我 tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] 将 pool_size_limit_ 从 1400 提高到 1540W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:31次get请求后put_count=1185 evicted_count=1000 eviction_rate=0.843882,不满意分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:40次get请求后put_count=8209 evicted_count=8000 eviction_rate=0.97454,未满足分配率=0W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:0个get请求后,put_count=2272 evicted_count=2000 eviction_rate=0.880282,不满意分配率=-nanW tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:0个get请求后,put_count=2362 evicted_count=2000 eviction_rate=0.84674,未满足分配率=-nanW tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator:38次get请求后,put_count=5436 evicted_count=5000 eviction_rate=0.919794,不满意分配率=0

这是什么意思,是否意味着我遇到了一些资源分配问题?我在 Titan X 3500+ CUDA,12 GB GPU 上运行

解决方案

TensorFlow 有多个内存分配器,用于以不同方式使用的内存.他们的行为具有一些适应性.

在您的特定情况下,由于您使用的是 GPU,因此有一个用于 CPU 内存的 PoolAllocator,它已预先注册到 GPU 以实现快速 DMA.一个预期从 CPU 转移到 GPU 的张量,例如,将从这个池中分配.

PoolAllocators 试图通过保留一个可立即重用的已分配然后释放的块池来分摊调用更昂贵的底层分配器的成本.他们的默认行为是缓慢增长,直到驱逐率降至某个常数以下.(驱逐率是我们将未使用的块从池中返回到底层池以不超过大小限制的空闲调用的比例.)在上面的日志消息中,您会看到显示池的Raising pool_size_limit_"行规模增长.假设您的程序实际上具有稳定状态的行为,它需要最大大小的块集合,池将增长以容纳它,然后不再增长.它以这种方式运行,而不是简单地保留所有分配过的块,以便很少需要或仅在程序启动期间需要的大小不太可能保留在池中.

如果您的内存不足,这些消息应该只引起关注.在这种情况下,日志消息可能有助于诊断问题.另请注意,只有在内存池增长到合适的大小后才能达到峰值执行速度.

While training a tensorflow seq2seq model I see the following messages :

W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 27282 get requests, put_count=9311 evicted_count=1000 eviction_rate=0.1074 and unsatisfied allocation rate=0.699032
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 100 to 110
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 13715 get requests, put_count=14458 evicted_count=10000 eviction_rate=0.691659 and unsatisfied allocation rate=0.675684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 110 to 121
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 6965 get requests, put_count=6813 evicted_count=5000 eviction_rate=0.733891 and unsatisfied allocation rate=0.741421
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 133 to 146
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 44 get requests, put_count=9058 evicted_count=9000 eviction_rate=0.993597 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 46 get requests, put_count=9062 evicted_count=9000 eviction_rate=0.993158 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 4 get requests, put_count=1029 evicted_count=1000 eviction_rate=0.971817 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 2 get requests, put_count=1030 evicted_count=1000 eviction_rate=0.970874 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 44 get requests, put_count=6074 evicted_count=6000 eviction_rate=0.987817 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 12 get requests, put_count=6045 evicted_count=6000 eviction_rate=0.992556 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 2 get requests, put_count=1042 evicted_count=1000 eviction_rate=0.959693 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 44 get requests, put_count=6093 evicted_count=6000 eviction_rate=0.984737 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 4 get requests, put_count=1069 evicted_count=1000 eviction_rate=0.935454 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 17722 get requests, put_count=9036 evicted_count=1000 eviction_rate=0.110668 and unsatisfied allocation rate=0.550615
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 792 to 871
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 6 get requests, put_count=1093 evicted_count=1000 eviction_rate=0.914913 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 6 get requests, put_count=1101 evicted_count=1000 eviction_rate=0.908265 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 3224 get requests, put_count=4684 evicted_count=2000 eviction_rate=0.426985 and unsatisfied allocation rate=0.200062
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 1158 to 1273
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 17794 get requests, put_count=17842 evicted_count=9000 eviction_rate=0.504428 and unsatisfied allocation rate=0.510228
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:239] Raising pool_size_limit_ from 1400 to 1540
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 31 get requests, put_count=1185 evicted_count=1000 eviction_rate=0.843882 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 40 get requests, put_count=8209 evicted_count=8000 eviction_rate=0.97454 and unsatisfied allocation rate=0
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 0 get requests, put_count=2272 evicted_count=2000 eviction_rate=0.880282 and unsatisfied allocation rate=-nan
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 0 get requests, put_count=2362 evicted_count=2000 eviction_rate=0.84674 and unsatisfied allocation rate=-nan
W tensorflow/core/common_runtime/gpu/pool_allocator.cc:227] PoolAllocator: After 38 get requests, put_count=5436 evicted_count=5000 eviction_rate=0.919794 and unsatisfied allocation rate=0

What does it mean , does it mean I am having some resource allocation issues? Am running on Titan X 3500+ CUDA ,12 GB GPU

解决方案

TensorFlow has multiple memory allocators, for memory that will be used in different ways. Their behavior has some adaptive aspects.

In your particular case, since you're using a GPU, there is a PoolAllocator for CPU memory that is pre-registered with the GPU for fast DMA. A tensor that is expected to be transferred from CPU to GPU, e.g., will be allocated from this pool.

The PoolAllocators attempt to amortize the cost of calling a more expensive underlying allocator by keeping around a pool of allocated then freed chunks that are eligible for immediate reuse. Their default behavior is to grow slowly until the eviction rate drops below some constant. (The eviction rate is the proportion of free calls where we return an unused chunk from the pool to the underlying pool in order not to exceed the size limit.) In the log messages above, you see "Raising pool_size_limit_" lines that show the pool size growing. Assuming that your program actually has a steady state behavior with a maximum size collection of chunks it needs, the pool will grow to accommodate it, and then grow no more. It behaves this way rather than simply retaining all chunks ever allocated so that sizes needed only rarely, or only during program startup, are less likely to be retained in the pool.

These messages should only be a cause for concern if you run out of memory. In such a case the log messages may help diagnose the problem. Note also that peak execution speed may only be attained after the memory pools have grown to the proper size.

这篇关于如何解释张量流中的 Poolallocator 消息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆