使用 TF-Slim 的全卷积 ResNet 运行速度非常慢 [英] Fully-convolutional ResNets using TF-Slim run very slow

查看:48
本文介绍了使用 TF-Slim 的全卷积 ResNet 运行速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将最初在 Caffe 中实现的像素标记(FCN 风格)代码移植到 TensorFlow.我使用 Slim 的 ResNets (ResNet-101) 实现,步幅为 16px,并使用向上卷积层进一步对其进行上采样,以实现最终步幅为 8px.我有 batch_size=1 因为输入图像是任意大小的.问题是训练真的很慢.它在大约 3.5 分钟内处理 100 个图像,而我的原始 caffe 实现在相同的硬件(特斯拉 K40m)上在 30 秒内完成.这是我拥有的代码的简化版本:

I'm porting the code that does pixel labeling (FCN-style) originally implemented in Caffe to TensorFlow. I use Slim's implementation of ResNets (ResNet-101) with stride of 16px and further upsample it with up-convolutional layer to achieve the final stride of 8px. I have batch_size=1 as the input images are of arbitrary size. The problem is that training is really slow. It processes 100 images in about 3.5 minutes, while my original caffe implementation does it in 30secs on the same hardware (Tesla K40m). Here's the reduced version of the code I have:

import datetime as dt

import tensorflow as tf
import tensorflow.contrib.slim as slim
from tensorflow.contrib.slim.nets import resnet_v1

from MyDataset import MyDataset
from TrainParams import TrainParams

dataset = MyDataset()
train_param = TrainParams()

#tf.device('/gpu:0')

num_classes = 15

inputs = tf.placeholder(tf.float32, shape=[1, None, None, 3])

with slim.arg_scope(resnet_v1.resnet_arg_scope(False)):
    mean = tf.constant([123.68, 116.779, 103.939],
                       dtype=tf.float32, shape=[1, 1, 1, 3], name='img_mean')
    im_centered = inputs - mean
    net, end_points = resnet_v1.resnet_v1_101(im_centered,
                                              global_pool=False, output_stride=16)

    pred_upconv = slim.conv2d_transpose(net, num_classes,
                                        kernel_size = [3, 3],
                                        stride = 2,
                                        padding='SAME')

    targets = tf.placeholder(tf.float32, shape=[1, None, None, num_classes])

    loss = slim.losses.sigmoid_cross_entropy(pred_upconv, targets)


log_dir = 'logs/'

variables_to_restore = slim.get_variables_to_restore(include=["resnet_v1"])
restorer = tf.train.Saver(variables_to_restore)

with tf.Session() as sess:

  sess.run(tf.initialize_all_variables())
  sess.run(tf.initialize_local_variables())

  restorer.restore(sess, '/path/to/ResNet-101.ckpt')

  optimizer = tf.train.GradientDescentOptimizer(learning_rate=.001)
  train_step = optimizer.minimize(loss)
  t1 = dt.datetime.now()
  for it in range(10000):
      n1=dt.datetime.now()
      batch = dataset.next_batch() # my function that prepares training batch
      sess.run(train_step, feed_dict={inputs: batch['inputs'],
                                      targets: batch['targets']})
      n2=dt.datetime.now()
      time = (n2-n1).microseconds/(1000)
      print("iteration ", it, "time", time)

我只是在学习框架,而且我只是在几天内整理了这段代码,所以我知道它可能不是最好的.如您所见,我还尝试测量数据准备代码和网络前向后传递所需的实际时间.而这个时间在总结 100 次迭代时实际上要小得多,与实际运行时间相比只有 50 秒.我怀疑可能有一些线程/进程同步正在进行,这没有被测量,但我觉得很奇怪.top 命令显示大约 10 个进程与可能由它产生的主要进程同名.我也收到这样的警告:

I'm only learning the framework, and I only put together this code in couple of days, so I understand it may not be the nicest. As you can see, I also try to measure the actual time it takes for data preparing code and forward-backward passes of the network. And this time is actually much smaller when summed up for 100 iterations, only 50sec compared to the real runtime. I suspect there can be some thread/process synchronization going on, that's not measured, but I find it quite strange. top command shows about 10 processes titled the same as the primary one that were perhaps spawned by it. I also receive warnings like so:

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 1692 get requests, put_count=1316 evicted_count=1000 eviction_rate=0.759878 and unsatisfied allocation rate=0.87234
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110

您能否指出我如何加快速度的正确方向?

Could you perhaps point me in a right direction on how I can speed this up?

谢谢.

更新.经过更多研究,我发现与队列相比,馈送"数据可能会很慢,因此我在单独的线程中使用队列重新实现了代码:https://gist.github.com/eldar/0ecc058670be340b92e5a1044dc8a089 但运行时间还是差不多的.

UPDATE. After more research I found that 'feeding' data can be slow compared to queues, so I re-implemented the code with the queue in a separate thread: https://gist.github.com/eldar/0ecc058670be340b92e5a1044dc8a089 but the runtime is still about the same.

更新2.看起来我知道速度的问题是什么.我进行全卷积训练,我的图像具有任意大小和纵横比.如果我提供固定大小的虚拟随机 numpy 张量,它会运行得很快.如果生成 10 个预定义大小的输入张量,前 10 次迭代很慢,但随后会加快.看起来在 TensorFlow 中,每次迭代时调整所有张量的大小都没有 Caffe 中那么有效.我将在项目的 GitHub 上提交工单.

UPDATE2. Looks like I figured what the issue with the speed is. I train fully-convolutionally, and my images are of arbitrary sizes and aspect ratios. If I feed dummy random numpy tensors of fixed size, it works fast. If generate input tensors of 10 predefined sizes, the first 10 iterations are slow, but then it speeds up. Looks like in TensorFlow the resizing of all tensors at each iteration is not as efficient as in Caffe. I will file a ticket on the project's GitHub.

推荐答案

问题是由于输入的图像大小不一.TensorFlow 有一种叫做自动调整的东西,所以在运行时他们会为每个特定的输入大小分析各种算法,并决定哪个是最好的.在我的情况下,这发生在每次迭代中.

The issue was due to input images of arbitrary size. TensorFlow has something which is called auto-tuning, so at run-time they profile various algorithms for each particular input size, and decide which is best. In my case this was happening at each iteration.

解决方案是设置环境变量TF_CUDNN_USE_AUTOTUNE=0:

The solution was setting the environment variable TF_CUDNN_USE_AUTOTUNE=0:

export TF_CUDNN_USE_AUTOTUNE=0
python myscript.py

更多关于这个 Github 票:https://github.com/tensorflow/tensorflow/issues/5048

More at this Github ticket: https://github.com/tensorflow/tensorflow/issues/5048

这篇关于使用 TF-Slim 的全卷积 ResNet 运行速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆