如何使用CUDA固定的“零复制"功能用于内存映射文件的内存? [英] How to use CUDA pinned "zero-copy" memory for a memory mapped file?

查看:170
本文介绍了如何使用CUDA固定的“零复制"功能用于内存映射文件的内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标/问题

在Python中,我正在寻找一种从内存映射文件到GPU读取/写入数据的快速方法.

在先前的SO溢出帖子中[ cud-零拷贝内存,内存映射文件]尽管人当时在用C ++工作.

我以前的尝试是在Cupy中进行的,但是我可以接受任何cuda方法.

到目前为止我尝试过的一切

我提到了我如何尝试使用Cupy的方法,该方法允许您以内存映射模式打开numpy文件.

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

我尝试过的结果

我的尝试导致OutOfMemoryError:

有人提到

cupy.load似乎要求整个文件首先适合主机内存,然后适合设备内存.

也有人提到

CuPy无法处理mmap内存.因此,默认情况下,CuPy直接使用GPU内存. https: //docs-cupy.chainer.org/zh-CN/stable/reference/produced/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc 如果要使用统一内存,可以更改默认内存分配器.

我尝试使用

cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)

但是,这似乎没有什么不同.在发生错误时,我的CPU内存为〜16演出,但我的GPU内存为0.32演出.我正在使用Google colab,其中我的CPU Ram为25个演出,GPU Ram为12个演出.因此,在将整个文件托管在主机内存中之后,它看起来像是在检查它是否可以容纳在设备内存中,并且当它发现在所需的16个演出中只有12个演出时,它抛出了错误(我的最佳猜测).

因此,现在,我正在尝试找出一种使用固定的零复制"内存来处理将数据馈送到GPU的内存映射文件的方法.

如果重要,我尝试传输的数据类型为浮点数组.通常,对于只读数据,二进制文件已加载到GPU内存中,但是我正在处理数据,因此我尝试在每一步都进行读写操作.

解决方案

在我看来,cupy当前没有提供可用于代替常规设备内存分配器的固定分配器,即可以用作cupy.ndarray的后盾.如果这对您很重要,则您可以考虑提交问题问题.

但是,似乎可以创建一个.这应该被视为实验代码.并且与它的使用有关的一些问题.

基本思想是,我们将使用已经建议您使用的cupy.cuda.set_allocator来替换cupy的默认设备内存分配器.我们将需要自己替换BaseMemory类,该类用作cupy.cuda.memory.MemoryPointer的存储库.此处的主要区别在于,我们将使用固定的内存分配器而不是设备分配器.这是下面PMemory类的要点.

需要注意的其他几件事:

  • 在使用固定的内存(分配)完成所需的操作后,可能应将cupy分配器恢复为其默认值.不幸的是,与cupy.cuda.set_allocator不同,我没有找到相应的cupy.cuda.get_allocator,这使我感到cupy不足,这似乎也值得向我提出麻烦.但是,对于本演示,我们将仅恢复到None选择,该选择使用默认的设备内存分配器之一(但是不使用池分配器).
  • 通过提供这种简约的固定内存分配器,我们仍然建议Cupy这是普通的设备内存.这意味着不能直接从宿主代码访问它(实际上是,但是cupy不知道).因此,各种操作(例如cupy.load)将创建不需要的主机分配和不需要的复制操作.我认为解决这个问题将需要的不仅是我建议的小改变.但至少对于您的测试用例,此额外开销可能是可管理的.看来您想一次从磁盘加载数据,然后将其保留在那里.对于这种类型的活动,这应该是可管理的,尤其是因为您将其分为多个部分.正如我们将看到的,对于25GB的主机内存来说,处理四个5GB的块实在太多了.我们将需要为四个5GB块(实际上是固定的)分配主机内存,并且还需要一个额外的5GB开销"缓冲区的额外空间.因此25GB不足以实现这一目标.但是出于演示目的,如果我们将缓冲区大小减小到4GB(5x4GB = 20GB),我认为它可能适合您的25GB主机RAM大小.
  • 与cupy的默认设备内存分配器相关联的普通设备内存与特定设备相关联.固定的内存不必具有这样的关联,但是用相似的类对BaseMemory的琐碎替换意味着我们建议cupy,该设备"存储器与所有其他普通的设备存储器一样,具有特定的设备关联.在您这样的单个设备设置中,这种区别是没有意义的.但是,这不适用于固定内存的强大多设备使用.为此,再次建议是对cupy进行更健壮的更改,也许可以通过提出问题来解决.

这是一个例子:

import os
import numpy as np
import cupy



class PMemory(cupy.cuda.memory.BaseMemory):
    def __init__(self, size):
        self.size = size
        self.device_id = cupy.cuda.device.get_device_id()
        self.ptr = 0
        if size > 0:
            self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
    def __del__(self):
        if self.ptr:
            cupy.cuda.runtime.freeHost(self.ptr)

def my_pinned_allocator(bsize):
    return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)

cupy.cuda.set_allocator(my_pinned_allocator)

#Create 4 .npy files, ~4GB each
for i in range(4):
    print(i)
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    print(i)
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
cupy.cuda.set_allocator(None)

我没有在具有这些文件大小的25GB主机内存的安装程序中对此进行测试.但是我已经用超过我GPU的设备内存的其他文件大小对其进行了测试,并且似乎可以正常工作.

再次,实验代码,如果没有经过全面测试,您的工作量可能会有所不同,因此最好通过提交有问题的github问题来实现此功能.而且,正如我之前提到的那样,从设备代码访问这种设备内存"通常比普通的cupy设备内存要慢得多.

最后,这并不是真正的内存映射文件",因为所有文件内容都将被加载到主机内存中,此外,这种方法占用"了主机内存.如果要访问20GB的文件,则将需要超过20GB的主机内存.只要加载"了这些文件,就会使用20GB的主机内存.

Objective/Problem

In Python, I am looking for a fast way to read/write data from a memory mapped file to a GPU.

In a previous SO overflow post [ Cupy OutOfMemoryError when trying to cupy.load larger dimension .npy files in memory map mode, but np.load works fine ]

Where it is mentioned this is possible using CUDA pinned "zero-copy" memory. Furthermore, it seems that this method was developed by this person [ cuda - Zero-copy memory, memory-mapped file ] though that person was working in C++.

My previous attempts have been with Cupy, but I am open to any cuda methods.

What I have tried so far

I mentioned how I tried to use Cupy, which allows you to open numpy files in memmory mapped mode.

import os
import numpy as np
import cupy

#Create .npy files. 
for i in range(4):
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 2200000 , 512))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# Eventually results in memory error. 
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )

Result of what I have tried

My attempt resulting in OutOfMemoryError:

It was mentioned that

it appears that cupy.load will require that the entire file fit first in host memory, then in device memory.

And it was also mentioned that

CuPy can't handle mmap memory. So, CuPy uses GPU memory directly in default. https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.MemoryPool.html#cupy.cuda.MemoryPool.malloc You can change default memory allocator if you want to use Unified Memory.

I tried using

cupy.cuda.set_allocator(cupy.cuda.MemoryPool(cupy.cuda.memory.malloc_managed).malloc)

But this didn't seem to make a difference. At the time of the error, my CPU Ram was at ~16 gigs, but my GPU ram was at 0.32 gigs. I am using Google colab where my CPU Ram is 25 gigs and GPU ram is 12 gigs. So it looks like that after the entire file was hosted in host memory, it checked that if it could fit in device memory, and when it saw that it only has 12 out of the required 16 gigs, it threw an error (my best guess).

So, now I am trying to figure out a way to use pinned 'zero-copy' memory to handle a memory mapped file which would feed data to the GPU.

If important, the type of data I am trying to transfer are floating point arrays. Normally, for read-only data, binary files are loaded into GPU memory, but I am working with data I am try to both read and write at every step.

解决方案

It appears to me that currently, cupy doesn't offer a pinned allocator that can be used in place of the usual device memory allocator, i.e. could be used as the backing for cupy.ndarray. If this is important to you, you might consider filing a cupy issue.

However, it seems like it may be possible to create one. This should be considered experimental code. And there are some issues associated with its use.

The basic idea is that we will replace cupy's default device memory allocator with our own, using cupy.cuda.set_allocator as was already suggested to you. We will need to provide our own replacement for the BaseMemory class that is used as the repository for cupy.cuda.memory.MemoryPointer. The key difference here is that we will use a pinned memory allocator instead of a device allocator. This is the gist of the PMemory class below.

A few other things to be aware of:

  • after doing what you need with pinned memory (allocations) you should probably revert the cupy allocator to its default value. Unfortunately, unlike cupy.cuda.set_allocator, I did not find a corresponding cupy.cuda.get_allocator, which strikes me as a deficiency in cupy, something that also seems worthy of filing a cupy issue to me. However for this demonstration we will just revert to the None choice, which uses one of the default device memory allocators (not the pool allocator, however).
  • by providing this minimalistic pinned memory allocator, we are still suggesting to cupy that this is ordinary device memory. That means it's not directly accessible from the host code (it is, actually, but cupy doesn't know that). Therefore, various operations (such as cupy.load) will create unneeded host allocations, and unneeded copy operations. I think to address this would require much more than just this small change I am suggesting. But at least for your test case, this additional overhead may be manageable. It appears that you want to load data from disk once, and then leave it there. For that type of activity, this should be manageable, especially since you are breaking it up into chunks. As we will see, handling four 5GB chunks will be too much for 25GB of host memory. We will need host memory allocation for the four 5GB chunks (which are actually pinned) and we will also need additional space for one additional 5GB "overhead" buffer. So 25GB is not enough for that. But for demonstration purposes, if we reduce your buffer sizes to 4GB (5x4GB = 20GB) I think it may fit within your 25GB host RAM size.
  • Ordinary device memory associated with cupy's default device memory allocator, has an association with a particular device. pinned memory need not have such an association, however our trivial replacement of BaseMemory with a lookalike class means that we are suggesting to cupy that this "device" memory, like all other ordinary device memory, has a specific device association. In a single device setting such as yours, this distinction is meaningless. However, this isn't suitable for robust multi-device use of pinned memory. For that, again the suggestion would be a more robust change to cupy, perhaps by filing an issue.

Here's an example:

import os
import numpy as np
import cupy



class PMemory(cupy.cuda.memory.BaseMemory):
    def __init__(self, size):
        self.size = size
        self.device_id = cupy.cuda.device.get_device_id()
        self.ptr = 0
        if size > 0:
            self.ptr = cupy.cuda.runtime.hostAlloc(size, 0)
    def __del__(self):
        if self.ptr:
            cupy.cuda.runtime.freeHost(self.ptr)

def my_pinned_allocator(bsize):
    return cupy.cuda.memory.MemoryPointer(PMemory(bsize),0)

cupy.cuda.set_allocator(my_pinned_allocator)

#Create 4 .npy files, ~4GB each
for i in range(4):
    print(i)
    numpyMemmap = np.memmap( 'reg.memmap'+str(i), dtype='float32', mode='w+', shape=( 10000000 , 100))
    np.save( 'reg.memmap'+str(i) , numpyMemmap )
    del numpyMemmap
    os.remove( 'reg.memmap'+str(i) )

# Check if they load correctly with np.load.
NPYmemmap = []
for i in range(4):
    print(i)
    NPYmemmap.append( np.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
del NPYmemmap

# allocate pinned memory storage
CPYmemmap = []
for i in range(4):
    print(i)
    CPYmemmap.append( cupy.load( 'reg.memmap'+str(i)+'.npy' , mmap_mode = 'r+' )  )
cupy.cuda.set_allocator(None)

I haven't tested this in a setup with 25GB of host memory with these file sizes. But I have tested it with other file sizes that exceed the device memory of my GPU, and it seems to work.

Again, experimental code, not thoroughly tested, your mileage may vary, would be better to attain this functionality via filing of cupy github issues. And, as I've mentioned previously, this sort of "device memory" will be generally much slower to access from device code than ordinary cupy device memory.

Finally, this is not really a "memory mapped file" as all the file contents will be loaded into host memory, and furthermore, this methodology "uses up" host memory. If you have 20GB of files to access, you will need more than 20GB of host memory. As long as you have those files "loaded", 20GB of host memory will be in use.

这篇关于如何使用CUDA固定的“零复制"功能用于内存映射文件的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆