了解张量流中的设备分配,并行性(tf.while_loop)和tf.function [英] Understanding device allocation, parallelism(tf.while_loop) and tf.function in tensorflow

查看:167
本文介绍了了解张量流中的设备分配,并行性(tf.while_loop)和tf.function的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图了解Tensorflow中GPU上的并行性,因为我需要将其应用于丑陋的图上.

I'm trying to understand parallelism on GPU in tensorflow as I need to apply it on uglier graphs.

import tensorflow as tf
from datetime import datetime

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([100000], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations=1000)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(i, [-1,1]), tf.constant([0], dtype=tf.dtypes.float32)))
    return tf.add(i,1)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,100000)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

在上面的代码中,var是一个长度为100000的张量,其元素如上所述更新.当我将parallel_iterations值从10、100、1000、10000更改时.即使明确提到了parallel_iterations变量,也几乎没有时间差(均为9.8s).

In the code above, var is a tensor with length 100000, whose elements are updated as shown above. When I change the parallel_iterations values from 10, 100, 1000, 10000. There's hardly any time difference (all at 9.8s) even though explicitly mentioning the parallel_iterations variable.

我希望这些并行发生在GPU上.我该如何实施?

I want these to happen parallely on GPU. How can I implement it?

推荐答案

首先,请注意,您的tensor_scatter_nd_update仅增加单个索引,因此,您只能测量循环本身的开销.

First, notice that your tensor_scatter_nd_update is just incrementing a single index, therefor you could only be measuring the overhead of the loop itself.

我修改了您的代码以使其具有更大的批处理大小.我在GPU下的Colab中运行,我需要batch = 10000来隐藏循环延迟.低于此水平的任何东西都会衡量(或支付)延迟开销.

I modified your code to do it with a much larger batch size. Running in Colab under a GPU, I needed batch=10000 to hide the loop latency. Anything below that measures (or pays for) a latency overhead.

另外,问题是,var.assign(tensor_scatter_nd_update(...))是否实际上阻止了tensor_scatter_nd_update制作的多余副本?使用批处理大小显示,实际上我们并不需要支付额外的副本,因此,很好地防止了额外副本的出现.

Also, the question is, does var.assign(tensor_scatter_nd_update(...)) actually prevent the extra copy made by tensor_scatter_nd_update? Playing with batch size shows that indeed we're not paying for extra copies, so the extra copy seems to be prevented very nicely.

但是,事实证明,在这种情况下,张量流显然只是认为迭代是相互依赖的,因此,如果增加循环迭代,它不会有任何区别(至少在我的测试中如此).请参阅此以进一步了解TF的作用: https://github.com/tensorflow/tensorflow/Issues/1984

However, it turns out that in this case, apparently, tensorflow just considers the iterations to be dependent on each other, therefor it doesn't make any difference (at least in my test) if you increase the loop iterations. See this for further discussion on what TF does: https://github.com/tensorflow/tensorflow/issues/1984

仅当它们独立(操作)时,它才并行执行操作.

It does things in parallel only if they are independent (operations).

顺便说一句,任意分散操作在GPU上的效率不是很高,但是如果TF认为它们是独立的,您仍然可以(应该)并行执行多个操作.

BTW, an arbitrary scatter op isn't going to be very efficient on a GPU, but you still might be (should be) able to perform several in parallel if TF considers them independent.

import tensorflow as tf
from datetime import datetime

size = 1000000
index_count = size
batch = 10000
iterations = 10

with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([size], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
    indexes = tf.Variable(tf.range(index_count, dtype=tf.dtypes.int32), dtype=tf.dtypes.int32)
    var2 = tf.Variable(tf.range([index_count], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.function
def foo():
    return tf.while_loop(c, b, [i], parallel_iterations = iterations)      #tweak

@tf.function
def b(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(indexes, [-1,1])[i:i+batch], var2[i:i+batch]))
    return tf.add(i, batch)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,index_count)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

这篇关于了解张量流中的设备分配,并行性(tf.while_loop)和tf.function的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆