Mxnet-慢速阵列复制到GPU [英] Mxnet - slow array copy to GPU
问题描述
我的问题:我应该如何在mxnet中执行快速矩阵乘法?
My problem: How should I perform fast matrix multiplication in mxnet?
我的具体问题:将阵列复制到GPU的速度很慢.该怎么办?
My concrete problem: array copy to GPU is slow. What can be done about it?
我创建随机数组,将它们复制到上下文中,然后相乘.
I create random arrays, copy them to the context, and then multiply.
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import profiler
profiler.set_config(aggregate_stats=True)
ctx = mx.cpu()
# create arrays on CPU
profiler.set_state('run')
a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=mx.cpu())
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
# copy arrays to the context
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
# multiply arrays
profiler.set_state('run')
c = nd.dot(a_ctx, b_ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
在这段代码中,我在cpu上执行了所有操作,所以我的时间是(秒):
In this code I perform everything on cpu, so my times are (sec):
0.246
~=0
1.727
当我使用ctx=mx.gpu()
时,时间是
0.247
22.059
0.828
因此,瓶颈是从CPU到GPU的复制.这简直太慢了.该怎么办?
So the bottleneck is a copy from CPU to GPU. It's just ridiculously slow. What can be done about it?
这是有关此阶段的准确信息:
This is a precise information about this stage:
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: gpu/0 2 400000.0000 400000.0000 800000.0000 200000.0000
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXImperativeInvokeEx 2 22059.0703 0.0360 22059.0352 11029.5352
MXNDArrayGetShape 2 0.0030 0.0000 0.0030 0.0015
MXNDArrayWaitAll 1 105.9830 105.9830 105.9830 105.9830
MXNDArrayCreateEx 2 0.0150 0.0060 0.0090 0.0075
MXNDArrayGetContext 2 0.0020 0.0000 0.0020 0.0010
MXNet C API Concurrency 22 0.0000 0.0000 0.0010 0.0005
MXNDArrayGetDType 2 0.0010 0.0000 0.0010 0.0005
MXNet C API Calls 11 0.0140 0.0040 0.0140 0.0050
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
CopyCPU2GPU 4 318.4930 53.3060 105.9400 79.6233
请告诉我是否需要更多信息.
Please tell me if more information is needed.
推荐答案
从分析结果中可以看到CopyCPU2GPU
仅花费318ms. 22秒的额外开销与GPU上下文初始化和malloc有关.如果您仅在同一脚本中第二次运行GPU复制代码,您应该会看到更快的结果.您可以这样修改代码:
You can see from your profiling results that CopyCPU2GPU
only takes 318ms. The extra overhead of 22 seconds is related to GPU-context initialization and malloc. If you simply run the GPU-copy code a second time in the same script, you should see a much faster result. You can modify your code like this:
# copy arrays to the context
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('run')
a_ctx = a.as_in_context(ctx)
b_ctx = b.as_in_context(ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
要考虑的另一件事是最小化CPU-> GPU内存副本.例如,在您的特定示例中,您可以在GPU而不是CPU中创建随机数组:
Another thing to consider is to minimize the CPU->GPU memory copy. For example in your specific example, you can create random arrays in GPU instead of CPU:
a = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)
b = nd.random.uniform(-1, 1, shape=(10000, 10000), ctx=ctx)
CUDA内存分配/取消分配需要一些系统同步,这使其运行缓慢.所有DL框架都将内存管理权交给了自己,但是创建了一个缓冲池,该缓冲池重用了先前分配的缓冲区,并且仅在绝对必要时才进行内存分配/释放.例如,默认情况下,tensorflow会以单个分配方式分配整个GPU内存,并在内部将其分配给张量. MXNet和PyTorch在需要时进行分配,但在释放时保留在缓冲池中,以便以后可以重用.
CUDA memory allocation/deallocation requires some system synchronization which makes it slow. All DL framworks take memory management into their own hands but creating a buffer pool that reuses previously allocated buffers and doing memory allocation/deallocation only when absolutely necessary. For example tensorflow allocates the entire GPU memory by default in a single allocation and internally allocates it to tensors. MXNet and PyTorch allocate when necessary, but keep in buffer pool when released so that it can be reused later.
MXNet/PyTorch的这种行为意味着,在第一次调用以创建特定大小的张量时,调用会变慢.但是,如果释放了该张量并创建了一个类似大小的新张量,则这次的内存来自预分配的缓冲池,而不是使用cudamalloc.您可以在此处阅读PyTorch的内存管理( https://pytorch. org/docs/stable/notes/cuda.html#cuda-memory-management ),与MXNet有点相似.
This behavior of MXNet/PyTorch means that on very first call to create a tensor of a specific size, the call would be slower. But if that tensor is released and a new tensor of similar size is created, this time the memory comes from pre-allocated buffer pool rather than using cudamalloc. You can read PyTorch's memory management here (https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management) which is somewhat similar to MXNet.
这篇关于Mxnet-慢速阵列复制到GPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!