PyTorch CUDA 与 Numpy 的算术运算?最快的? [英] PyTorch CUDA vs Numpy for arithmetic operations? Fastest?

查看:20
本文介绍了PyTorch CUDA 与 Numpy 的算术运算?最快的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用具有 GPU 支持的 Torch 和 Numpy 使用以下函数执行逐元素乘法,发现 Numpy 循环比 Torch 快,我怀疑这不应该是这种情况.

I performed element-wise multiplication using Torch with GPU support and Numpy using the functions below and found that Numpy loops faster than Torch which shouldn't be the case, I doubt.

我想知道如何使用 GPU 使用 Torch 执行一般算术运算.

I want to know how to perform general arithmetic operations with Torch using GPU.

注意:我在 Google Colab notebook 中运行了这些代码片段

Note: I ran these code snippets in Google Colab notebook

定义默认张量类型以启用全局 GPU 标志

torch.set_default_tensor_type(torch.cuda.FloatTensor if 
                              torch.cuda.is_available() else 
                              torch.FloatTensor)

初始化 Torch 变量

x = torch.Tensor(200, 100)  # Is FloatTensor
y = torch.Tensor(200,100) 

有问题的功能

def mul(d,f):
    g = torch.mul(d,f).cuda()  # I explicitly called cuda() which is not necessary
    return g

当调用上面的函数时%timeit mul(x,y)

退货:

最慢的运行时间比最快的运行时间长 10.22 倍.这可以意味着正在缓存中间结果.10000 次循环,最好共 3 个:每个循环 50.1 微秒

The slowest run took 10.22 times longer than the fastest. This could mean hat an intermediate result is being cached. 10000 loops, best of 3: 50.1 µs per loop

现在试用 numpy,

Now trial with numpy,

使用来自 Torch 变量的相同值

Used the same values from torch variables

x_ = x.data.cpu().numpy()
y_ = y.data.cpu().numpy()


def mul_(d,f):
    g = d*f
    return g

%timeit mul_(x_,y_)

退货

最慢的运行时间比最快的运行时间长 12.10 倍.这可以意味着正在缓存中间结果.100000 次循环,最好共 3 个:每个循环 7.73 微秒

The slowest run took 12.10 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 7.73 µs per loop

需要一些帮助来理解启用 GPU 的 Torch 操作.

Needs some help to understand GPU enabled Torch operations.

推荐答案

GPU 操作必须额外从 GPU 获取内存

问题是你的 GPU 操作总是要把输入放在 GPU 内存上,而且然后从那里检索结果,这是一个非常昂贵的操作.

另一方面,NumPy 直接处理来自 CPU/主存的数据,所以这里几乎没有延迟.此外,您的矩阵非常小,因此即使在最好的情况下,也应该只有微小的差异.

NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.

这也是您在神经网络的 GPU 上训练时使用小批量的部分原因:您现在拥有一大块"而不是几个极小的操作.您可以并行处理的数字.
另请注意,GPU 时钟速度通常远低于 CPU 时钟,所以 GPU 真正闪耀是因为它有更多的内核.如果您的矩阵没有充分利用所有这些,您也可能会在 CPU 上看到更快的结果.

This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.

TL;DR:如果你的矩阵足够大,你最终会看到 CUDA 比 Numpy 有加速,即使有额外的成本GPU 传输.

TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA than Numpy, even with the additional cost of the GPU transfer.

这篇关于PyTorch CUDA 与 Numpy 的算术运算?最快的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆