有没有办法在 GPU 上使用 tensorflow map_fn? [英] Is there a way to use tensorflow map_fn on GPU?

查看:24
本文介绍了有没有办法在 GPU 上使用 tensorflow map_fn?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个形状为 [a,n] 的张量 A,我需要使用另一个形状为 B 的张量执行操作 my_op[b,n] 使得结果张量 C 的形状为 [a,b].

I have a tensor A with shape [a,n] and I need to perform an op my_op with another tensor B of shape [b,n] such that the resulting tensor C has shape [a,b].

换句话说:对于 A (A[0], A1,...A[n]) 我需要在 B<中使用 each 子张量执行元素明智的操作/强>.

In other words: For each subtensor in A (A[0], A1,...A[n]) I need to perform an element wise op with each subtensor in B.

因此生成的张量将包含以下内容:

So the resulting tensor would contain the following:

[ [ A[0] op B[0] , A[0] op B[1], ... , A[0] op B[b] ],
  [ A[1] op B[0] , A[1] op B[1], ... , A[1] op B[b] ],
  [ ...                                             ],
  [ A[a] op B[0] , A[a] op B[1], ... , A[a] op B[b] ] ]

我找到的唯一方法是嵌套使用 tf.map_fn因此:

The only way that I've been able to find that achieves this is through nested use of tf.map_fn Thus:

import tensorflow as tf
import time
import numpy as np

a_size = 64
b_size = 256*256
n = 256
A = tf.placeholder(tf.float32,[a_size,n])
B = tf.placeholder(tf.float32,[b_size,n])

def elementwise_op(a,b):
    return tf.reduce_sum(tf.multiply(a,b))

def intermediate_op(sub_a,my_b):
    sample_values = tf.map_fn(lambda x: elementwise_op(sub_a,x),my_b)
    return sample_values

my_op = tf.map_fn(lambda x: intermediate_op(x,B),A)

with tf.Session() as sess:
    a = np.random.rand(a_size,n)
    b = np.random.rand(b_size,n)
    start_time = time.time()
    result = sess.run (my_op,feed_dict={A:a,B:b})
    print ("exec time: " ,time.time()-start_time)
    print (result.shape)

上面的代码运行良好,但是,它并没有很好地使用 GPU(根据 nvidia-smi,只有约 15% 的利用率).事实上,当使用 CPU 时,它的运行速度要快一个数量级!!(在我的 12 核机器上)使用 GPU 运行时,我发现 GPU 利用率非常低 (~15%),并且在我的 一个 CPU 内核上为 100%.仅在 CPU 上运行时,我看到所有 CPU 内核的利用率为 100%.

The code above runs fine, however, it does not use the GPU very well (only ~15% utilization, according to nvidia-smi). In fact, it runs an order of magnitude faster when using only the CPU!! (on my 12 core machine) When run using the GPU, I see very low GPU utilization (~15%) and 100% on one of my CPU cores. When run on the CPU only, I see 100% utilization across all CPU cores.

仅运行 5 个 CPU 的平均时间:11.33s

Average timing of 5 CPU only runs: 11.33s

5 次 GPU 运行的平均时间:111.88s

Average timing of 5 GPU runs: 111.88s

以上测试使用官方 Tensorflow docker 镜像运行:tensorflow/tensorflow:latest-py3(用于 CPU)和 tensorflow/tensorflow:latest-gpu-py3(用于 GPU)

The above test was run using the official Tensorflow docker images: tensorflow/tensorflow:latest-py3 (for CPU) and tensorflow/tensorflow:latest-gpu-py3 (for GPU)

我的猜测是 map_fn,通过 python lambda,强制数据在每次迭代时在 CPU 和 GPU 之间来回复制,嵌套性质的操作只会让情况变得更糟.未回答的 SO 问题 here 中的评论表明情况就是如此.

My guess is that map_fn, via the python lambda, is forcing data to be copied back and forth between the CPU and GPU at every iteration, and the nested nature of the op just makes it worse. The comments in unanswered SO question here suggest that this is the case.

这个文章声称:

lambda 表达式是 GPU 利用率低的主要原因.

lambda expression is the main reason of low GPU utilization.

-

所以我的问题是:有没有办法强制 map_fn 使用 GPU?还是为了避免 Python lambda?

So my question is: Is there a way to force map_fn to use the GPU? Or to avoid the Python lambda?

或者,是否有其他一些(也许更多 tensorflow-y)方法来实现上述结果,以便让图在 GPU 上运行?

Alternatively, is there some other (perhaps more tensorflow-y) way to achieve the result described above, in order to the get graph to run on the GPU?

运行探查器后(我不得不大幅减少数组的大小才能让探查器运行,因为它像疯了一样吃掉 RAM),以下几行引起了我的注意:

After running the profiler (I had to drastically reduce the size of the arrays to get the profiler to run at all, because it was eating up RAM like crazy), the following lines caught my attention:

node name     |     output bytes     |      total execution time     | accelerator execution time     |     cpu execution time

Mul                    1.02KB (22.23%, 0.29%),      195.07ms (85.00%, 13.06%),       5.29ms (100.00%, 25.79%),      189.78ms (84.79%, 12.89%)

Sum                      256B (21.41%, 0.07%),      241.48ms (69.08%, 16.17%),        6.01ms (74.21%, 29.29%),      235.47ms (69.01%, 15.99%)

TensorArrayScatterV3      512B (0.64%, 0.15%),      658.31ms (46.87%, 44.09%),        9.19ms (44.80%, 44.80%),      649.12ms (46.90%, 44.08%)

看起来某些操作主要在 CPU 上完成,而且只在一个线程上完成!

It looks like certain ops are being done mostly on the CPU, and only on one thread at that!

推荐答案

tf.map_fn() 构造可以与在 GPU 上运行操作的函数一起使用.默认情况下,TensorFlow 将尝试在 GPU 上运行尽可能多的功能,任何与 GPU 不兼容的操作都将在 CPU 上运行.在您的程序中,整个 elementwise_op() 函数是由 GPU 兼容的操作构建的,因此每次迭代时 CPU 和 GPU 之间不应有额外的复制.

The tf.map_fn() construct can be used with a function that runs ops on GPU. By default, TensorFlow will try to run as much of the function as possible on the GPU, and any GPU-incompatible ops will run on the CPU. In your program, the entire elementwise_op() function is built from GPU-compatible ops, so there should be no additional copying between CPU and GPU at each iteration.

GPU 利用率低的原因很难从程序片段中确定.例如,如果 AB 相对较小,并且您从 Python 中馈送它们并立即取回结果,则很可能是复制初始值的开销进出 GPU 的数据将占主导地位.跟踪此问题的最佳方法是使用 GPU 分析器,您可以使用 tfprofNVIDIA可视化分析器.

The cause of low GPU utilization is difficult to determine from a program fragment. For example, if A and B are relatively small, and you are feeding them from Python and the immediately fetching back the result, it is likely that the overhead of copying the initial data to and from the GPU would dominate. The best way to track this down is to use a GPU profiler, which you can get using tfprof or the NVIDIA Visual Profiler.

这篇关于有没有办法在 GPU 上使用 tensorflow map_fn?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆