与TensorFlow/cuDNN中的NHWC相比,NCHW快多少? [英] How much faster is NCHW compared to NHWC in TensorFlow/cuDNN?

查看:184
本文介绍了与TensorFlow/cuDNN中的NHWC相比,NCHW快多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

TensorFlow官方性能指南指出:

CNN使用的大多数TensorFlow操作都支持NHWC和NCHW数据格式.在GPU上,NCHW更快.但是在CPU上,NHWC有时会更快.

Most TensorFlow operations used by a CNN support both NHWC and NCHW data format. On GPU, NCHW is faster. But on CPU, NHWC is sometimes faster.

在卷积中,NCHW与TensorFlow/cuDNN中的NHWC相比要快多少?是否有任何参考或基准?

How much faster is NCHW compared to NHWC in TensorFlow/cuDNN, for convolution? Are there any references or benchmarks for this?

此外,为什么速度更快?据我了解(请参阅此处),GPU上用于NHWC的TensorFlow将始终在内部转置为NCHW,然后为NCHW调用cuDNN conv内核,然后将其转置回去.但是为什么要这样做呢?cuDNN转换内核也适用于NHWC.也许他们在某个时候进行了比较,并且NHWC的cuDNN conv内核非常慢.但这是最新的吗?差异有多大?NHWC这么慢的技术原因是什么?还是针对这种情况的cuDNN内核没有得到很好的优化?

Also, why is it faster? As I understand (see here), TensorFlow for NHWC on GPU will internally always transpose to NCHW, then calls the cuDNN conv kernel for NCHW, then transpose it back. But why does it do that? The cuDNN conv kernel also works for NHWC. Maybe at some point they did the comparison and the cuDNN conv kernel for NHWC was very slow. But is that up-to-date? And how big was the difference? What are the technical reasons that NHWC is so much slower? Or is the cuDNN kernel for this case just not well optimized?

推荐答案

原因是大多数简单卷积的实现(此处不涉及winograd或fft)最终会进行某种简单的矩阵乘法,这意味着它们在内部循环中,它们将两个张量中的一些值相乘并求和.

The reason is that most implementations of simple convolutions (not talking winograd or fft here), end up doing some kind of simple matrix multiplication, which means that in their inner loop they multiply some values from both tensors and sum the results.

在采用SSE或AVX优化的CPU实现中,沿着C维执行此操作更快,因为您只需将4乘以4或8乘以8,然后进行归约(将4或添加完所有C维后,最后添加8次.

On a CPU implementation, using SSE or AVX optimization, it's faster to do this along the C dimension, because you just multiply-add the values 4 by 4 or 8 by 8, and then do the reduction (sum your 4 or 8 accumulations) at the end once you added all the C dimension.

但是,在GPU上,减少线程数量是一项成本更高的操作(至少直到开普勒引入包装级原子操作之前),因此从历史上来说,它已经过优化,因此包装中的每个线程都可以连续读取(内存)HW值,并通过循环在C的各个部分进行累加.

On a GPU however, doing a reduction across threads is a more costly operation (at least it was until Kepler introduced wrap-level atomic operations), so historically it has been optimized so that each thread in a wrap reads consecutive (in memory) HW values, and do the accumulation over parts of C with a loop.

请注意,尽管最新的nvidia卡(RTX)现在具有张量乘法核心,可以在一个操作中处理小块,包括减少一小部分C,因此在这些卡上,使用NHWC的速度实际上更快(或混合NCHWC格式).

Note though that the latest nvidia cards (RTX), now have tensor multiplication cores, that can process small blocks in one operation, including the reduction over a small portion of C, so on these cards it's actually faster to use NHWC (or hybrid NCHWC formats).

这篇关于与TensorFlow/cuDNN中的NHWC相比,NCHW快多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆