如何将大于VRAM大小的数据传递到GPU? [英] How to pass data bigger than the VRAM size into the GPU?

查看:109
本文介绍了如何将大于VRAM大小的数据传递到GPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试向我的GPU中传递比VRAM更多的数据,这导致以下错误. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

I am trying to pass more data into my GPU than I have VRAM, which results in the following error. CudaAPIError: Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

我创建了以下代码来重新创建问题:

I created this code to recreate the problem:

from numba import cuda
import numpy as np


@cuda.jit()
def addingNumbers (big_array, big_array2, save_array):
    i = cuda.grid(1)
    if i < big_array.shape[0]:
        for j in range (big_array.shape[1]):
            save_array[i][j] = big_array[i][j] * big_array2[i][j]



big_array = np.random.random_sample((1000000, 500))
big_array2  = np.random.random_sample((1000000, 500))
save_array = np.zeros(shape=(1000000, 500))


arraysize = 1000000
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))


d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)

addingNumbers[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)

save_array = d_save_array.copy_to_host()

是否有一种方法可以将数据动态传递到GPU,以处理比VRAM所能容纳的数据更多的数据?如果不是,那么将所有这些数据手动传递到gpu的推荐方法是什么.使用dask_cuda是一个选项,还是类似的性质?

Is there a way to dynamically pass data into the GPU to be able to handle more data than the VRAM can hold? If not, what would be the recommended way to manually pass all this data to the gpu. Is using dask_cuda an option, or something of that nature?

推荐答案

关于如何处理较大问题(即数据集)并将其分解并在numba CUDA中分段处理的一个写得很好的示例是此处.特别地,感兴趣的变体是pricer_cuda_overlap.py.不幸的是,该示例利用了我认为accelerate.cuda.rand中已弃用的随机数生成功能,因此在当今的numba中无法直接运行(我认为).

A well-written example of how to take a larger problem (i.e. dataset) and break it into pieces, and handle the processing piece-wise in numba CUDA is here. In particular, the variant of interest is pricer_cuda_overlap.py. Unfortunately that example makes use of what I believe is deprecated random number generation functionality in accelerate.cuda.rand, so it's not directly runnable in today's numba (I think).

但是,出于此处问题的目的,随机数生成过程无关紧要,因此我们可以简单地删除它而不会影响重要的观察结果.然后是该示例中各个文件组成的单个文件:

However for the purposes of the question here, the random number generation process is irrelevant, and so we can simply remove that without affecting the important observations. What follows then is a single file assembled from various pieces in various files in that example:

$ cat t45.py
#! /usr/bin/env python
"""
This version demonstrates copy-compute overlapping through multiple streams.
"""
from __future__ import print_function

import math
import sys

import numpy as np

from numba import cuda, jit

from math import sqrt, exp
from timeit import default_timer as timer
from collections import deque

StockPrice = 20.83
StrikePrice = 21.50
Volatility = 0.021  #  per year
InterestRate = 0.20

Maturity = 5. / 12.

NumPath = 500000
NumStep = 200

def driver(pricer, pinned=False):
    paths = np.zeros((NumPath, NumStep + 1), order='F')
    paths[:, 0] = StockPrice
    DT = Maturity / NumStep

    if pinned:
        from numba import cuda
        with cuda.pinned(paths):
            ts = timer()
            pricer(paths, DT, InterestRate, Volatility)
            te = timer()
    else:
        ts = timer()
        pricer(paths, DT, InterestRate, Volatility)
        te = timer()

    ST = paths[:, -1]
    PaidOff = np.maximum(paths[:, -1] - StrikePrice, 0)
    print('Result')
    fmt = '%20s: %s'
    print(fmt % ('stock price', np.mean(ST)))
    print(fmt % ('standard error', np.std(ST) / sqrt(NumPath)))
    print(fmt % ('paid off', np.mean(PaidOff)))
    optionprice = np.mean(PaidOff) * exp(-InterestRate * Maturity)
    print(fmt % ('option price', optionprice))

    print('Performance')
    NumCompute = NumPath * NumStep
    print(fmt % ('Mstep/second', '%.2f' % (NumCompute / (te - ts) / 1e6)))
    print(fmt % ('time elapsed', '%.3fs' % (te - ts)))

class MM(object):
    """Memory Manager

    Maintain a freelist of device memory for reuse.
    """
    def __init__(self, shape, dtype, prealloc):
        self.device = cuda.get_current_device()
        self.freelist = deque()
        self.events = {}
        for i in range(prealloc):
            gpumem = cuda.device_array(shape=shape, dtype=dtype)
            self.freelist.append(gpumem)
            self.events[gpumem] = cuda.event(timing=False)

    def get(self, stream=0):
        assert self.freelist
        gpumem = self.freelist.popleft()
        evnt = self.events[gpumem]
        if not evnt.query(): # not ready?
            # querying is faster then waiting
            evnt.wait(stream=stream) # future works must wait
        return gpumem

    def free(self, gpumem, stream=0):
        evnt = self.events[gpumem]
        evnt.record(stream=stream)
        self.freelist.append(gpumem)


if sys.version_info[0] == 2:
    range = xrange

@jit('void(double[:], double[:], double, double, double, double[:])',
     target='cuda')
def cu_step(last, paths, dt, c0, c1, normdist):
    i = cuda.grid(1)
    if i >= paths.shape[0]:
        return
    noise = normdist[i]
    paths[i] = last[i] * math.exp(c0 * dt + c1 * noise)

def monte_carlo_pricer(paths, dt, interest, volatility):
    n = paths.shape[0]
    num_streams = 2

    part_width = int(math.ceil(float(n) / num_streams))
    partitions = [(0, part_width)]
    for i in range(1, num_streams):
        begin, end = partitions[i - 1]
        begin, end = end, min(end + (end - begin), n)
        partitions.append((begin, end))
    partlens = [end - begin for begin, end in partitions]

    mm = MM(shape=part_width, dtype=np.double, prealloc=10 * num_streams)

    device = cuda.get_current_device()
    blksz = device.MAX_THREADS_PER_BLOCK
    gridszlist = [int(math.ceil(float(partlen) / blksz))
                  for partlen in partlens]

    strmlist = [cuda.stream() for _ in range(num_streams)]

    # Allocate device side array - in original example this would be initialized with random numbers
    d_normlist = [cuda.device_array(partlen, dtype=np.double, stream=strm)
                  for partlen, strm in zip(partlens, strmlist)]

    c0 = interest - 0.5 * volatility ** 2
    c1 = volatility * math.sqrt(dt)

    # Configure the kernel
    # Similar to CUDA-C: cu_monte_carlo_pricer<<<gridsz, blksz, 0, stream>>>
    steplist = [cu_step[gridsz, blksz, strm]
               for gridsz, strm in zip(gridszlist, strmlist)]

    d_lastlist = [cuda.to_device(paths[s:e, 0], to=mm.get(stream=strm))
                  for (s, e), strm in zip(partitions, strmlist)]

    for j in range(1, paths.shape[1]):

        d_pathslist = [cuda.to_device(paths[s:e, j], stream=strm,
                                      to=mm.get(stream=strm))
                       for (s, e), strm in zip(partitions, strmlist)]

        for step, args in zip(steplist, zip(d_lastlist, d_pathslist, d_normlist)):
            d_last, d_paths, d_norm = args
            step(d_last, d_paths, dt, c0, c1, d_norm)

        for d_paths, strm, (s, e) in zip(d_pathslist, strmlist, partitions):
            d_paths.copy_to_host(paths[s:e, j], stream=strm)
            mm.free(d_paths, stream=strm)
        d_lastlist = d_pathslist

    for strm in strmlist:
        strm.synchronize()

if __name__ == '__main__':
    driver(monte_carlo_pricer, pinned=True)
$ python t45.py
Result
         stock price: 22.6720614385
      standard error: 0.0
            paid off: 1.17206143849
        option price: 1.07834858009
Performance
        Mstep/second: 336.40
        time elapsed: 0.297s
$

此示例中有很多事情要做,而如何在CUDA中编写流水线/重叠的代码的一般主题本身就是一个完整的答案,因此,我仅介绍重点内容. 此博客文章很好地覆盖了该主题在视图中使用CUDA C ++,而不是numba CUDA(python).但是,numba CUDA中的大多数关注项与CUDA C ++中的等效项之间存在1:1的对应关系.因此,我将假定了解诸如CUDA流之类的基本概念,以及如何将它们用于安排异步并发活动.

There's a lot going on in this example, and the general topic of how to write a pipelined/overlapped code in CUDA would be an entire answer by itself, so I will just cover highlights. The general topic is well covered in this blog post albeit with CUDA C++ in view, not numba CUDA (python). However there is a 1:1 correspondence between most items of interest in numba CUDA and their equivalent counterpart in CUDA C++. Therefore I will assume that basic concepts like CUDA streams, and how they are used to arrange asynchronous concurrent activity, are understood.

那么这个例子在做什么?我将主要关注CUDA方面.

So what is this example doing? I'll focus mostly on the CUDA aspects.

  • 出于复制和计算操作重叠的考虑,将输入数据(paths)转换为主机上的CUDA固定内存
  • 为了分块处理工作,定义了一个内存管理器(MM),该内存管理器将允许在处理过程中重用设备内存的块分配.
  • 创建
  • python列表以表示块处理的顺序.有一个列表定义每个块或分区的开始和结束.有一个列表定义了要使用的cuda流的顺序. CUDA内核将使用一列数据阵列分区.
  • 然后,在这些列表中,将发布深度优先"的作品.对于每个流,该流所需的数据(块)都将传输到设备(排队等待传输),启动处理该数据的内核(排队),然后传输会将结果从该块发送回给设备.主机内存已排队.在monte_carlo_pricerfor j循环中重复此过程,以获得步骤数(paths.shape[1]).
  • with a view toward overlap of copy and compute operations, the input data (paths) is converted to CUDA pinned memory on the host
  • with a view towards handling the work in chunks, a memory manager (MM) is defined, which will allow chunk allocations of device memory to be reused as the processing proceeds.
  • python lists are created to represent the sequence of chunk processing. There is a list that defines the start and end of each chunk or partition. There is a list that defines the sequence of cuda streams to be used. There is a list of data array partitions that the CUDA kernel will use.
  • then, with these lists, there is an issuance of work in "depth-first-order". For each stream, the data (chunks) necessary for that stream are transferred to the device (queued for transfer), the kernel that will process that data is launched (queued), and the transfer that will send the results from that chunk back to host memory is queued. This process is repeated in the for j loop in monte_carlo_pricer for the number of steps (paths.shape[1]).

当我使用事件探查器运行以上代码时,我们可以看到如下所示的时间轴:

When I run the above code using a profiler, we can see a timeline that looks like this:

在这种情况下,我在Quadro K2000上运行它,这是一个老式的小型GPU,只有一个复制引擎.因此,我们在配置文件中看到,最多1个复制操作与CUDA内核活动重叠,并且没有复制操作与其他复制操作重叠.但是,如果我在具有2个复制引擎的设备上运行此程序,则我希望可以有一个更紧密/更密集的时间轴,同时具有2个复制操作和一个计算操作的重叠,以实现最大吞吐量.为此,还必须将正在使用的流(num_streams)增加到至少3个.

In this particular case, I am running this on a Quadro K2000, which is an old, small GPU that has only one copy engine. Therefore we see in the profile that at most 1 copy operation is overlapped with CUDA kernel activity, and there are no copy operations overlapped with other copy operations. However if I ran this on a device with 2 copy engines, I would expect a tighter/denser timeline is possible, with overlap of 2 copy operations and a compute operation at the same time, for maximum throughput. To achieve this, the streams in use (num_streams) would also have to be increased to at least 3.

这里的代码不能保证没有缺陷.提供它是出于演示目的.使用它需要您自担风险.

The code here is not guaranteed to be defect free. It is provided for demonstration purposes. Use it at your own risk.

这篇关于如何将大于VRAM大小的数据传递到GPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆