了解推力(CUDA)内存使用情况 [英] Understanding Thrust (CUDA) memory usage

查看:93
本文介绍了了解推力(CUDA)内存使用情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用cuda/thrust库进行一些Monte Carlo模拟.在出现bad_alloc异常的某些模拟情况下,这非常有效.这似乎没问题,因为我的代码中越来越多的模拟意味着要处理越来越大的device_vector.因此,我希望这种异常会在某个时候出现.

我现在想做的是根据GPU上的可用内存来设置此模拟次数的上限.然后,我可以将工作量分成许多模拟.

因此,在启动模拟集之前,我一直在尝试解决问题.不幸的是,当我试图通过简单的例子来了解内存的管理方式时,我会得到令人惊讶的结果.

这是我正在测试的代码示例:

#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>

int main() 
{
    size_t freeMem, totalMem;

    cudaDeviceReset();
    cudaSetDevice(0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << "Total Memory | Free Memory "<< std::endl;
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1k(1000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec100k(100000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1M(1000000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    return 0;
}

这是我得到的结果:

Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336

基本上,

  • 1,000个元素向量(加上所有其他需要的内容)使用1,048,576字节
  • 100,000个元素向量也使用1,048,576字节!
  • 1,000,000个元素向量使用4,063,232字节.

我本来希望内存使用量随元素数量大致缩放,但是当我期望"10x"时得到"4x",并且这种关系在1000到100,000个元素之间不成立.

所以,我的两个问题是:

  • 有人可以帮我理解那些数字吗?
  • 如果我无法估计代码将使用的内存量,那么确保我的程序适合内存的最佳策略是什么?

修改

在Mai Longdong的评论之后,我尝试了两个向量,一个是262144浮点数(4个字节),另一个是262145浮点数.不幸的是,事情看起来不像是每1MB页面分配":

  • 第一个向量的大小(262144个浮点数):1048576字节
  • 第二个向量的大小(262145个浮点数):1179648字节

两者之间的增量是131072字节(或128 KB).页面大小是否可变?这有道理吗?

解决方案

Thrust对内存管理没有任何作用,默认分配器仅为cudaMalloc,您将看到驱动程序内存管理器页面大小选择算法在起作用.这没有记录,也没有迹象表明平台和硬件版本之间的行为是一致的.

也就是说,如果我将您的代码扩展为更有用的内容:

#include <iostream>
#include <vector>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

void report_mem(size_t allocd, bool first=false)
{
    size_t freeMem, totalMem;
    cudaMemGetInfo(&freeMem, &totalMem);
    if (first) 
        std::cout << "Allocated | Total Memory | Free Memory "<< std::endl;
    std::cout << allocd << ", " << totalMem << ", " << freeMem << std::endl;
}

int main() 
{
    cudaSetDevice(0);

    report_mem(0, true);
    std::vector<size_t> asizes;
    const int nallocs = 10;
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<14);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<16);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<18);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<20);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<22);

    typedef thrust::device_vector<float> dvecf_t;
    std::vector<dvecf_t*> allocs;
    auto it = asizes.begin();
    for(; it != asizes.end(); ++it) {
        dvecf_t* v = new dvecf_t(*it);
        allocs.push_back(v);
    report_mem(v->capacity() * sizeof(float));
    }
    return 0;
}

并在Windows 64位的计算2.1设备上运行它,我得到了:

Allocated | Total Memory | Free Memory 
0, 1073741824, 1007849472
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1003655168
262144, 1073741824, 1003655168
1048576, 1073741824, 1002606592
1048576, 1073741824, 1001558016
1048576, 1073741824, 1000509440
1048576, 1073741824, 999460864
1048576, 1073741824, 998412288
1048576, 1073741824, 997363712
1048576, 1073741824, 996315136
1048576, 1073741824, 995266560
1048576, 1073741824, 994217984
1048576, 1073741824, 993169408
4194304, 1073741824, 988975104
4194304, 1073741824, 984780800
4194304, 1073741824, 980586496
4194304, 1073741824, 976392192
4194304, 1073741824, 972197888
4194304, 1073741824, 968003584
4194304, 1073741824, 963809280
4194304, 1073741824, 959614976
4194304, 1073741824, 955420672
4194304, 1073741824, 951226368
16777216, 1073741824, 934449152
16777216, 1073741824, 917671936
16777216, 1073741824, 900894720
16777216, 1073741824, 884117504
16777216, 1073741824, 867340288
16777216, 1073741824, 850563072
16777216, 1073741824, 833785856
16777216, 1073741824, 817008640
16777216, 1073741824, 800231424

我解释为

表示在我对此进行测试的平台上分配粒度为1MiB(1048576或2 ^ 20字节).您的平台可能有所不同.

I 'm using the cuda/thrust library to do some Monte Carlo simulations. This works very well up to a certain number of simulations where I get a bad_alloc exception. This seems alright because an increasing number of simulations in my code means handling increasingly large device_vectors. So I expect this kind of exception to show up at some point.

What I'd like to do now is to set an upper limit on this number of simulations based on the available memory on my GPU. Then, I could split the workload in bunches of simulations.

So I've been trying to size my problem before launching my set of simulations. Unfortunately, when I'm trying to understand the way the memory is managed with simple examples I get surprising results.

Here is an example of code I have been testing:

#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>

int main() 
{
    size_t freeMem, totalMem;

    cudaDeviceReset();
    cudaSetDevice(0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << "Total Memory | Free Memory "<< std::endl;
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1k(1000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec100k(100000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1M(1000000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    return 0;
}

And here are the results I get:

Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336

So, basically,

  • the 1,000 element vector (plus everything else needed) uses 1,048,576 bytes
  • the 100,000 element vector uses also 1,048,576 bytes!
  • the 1,000,000 element vector uses 4,063,232 bytes.

I would have expected the memory usage to scale roughly with the number of elements but I get a "4x" when I expected a "10x", and this relationship does not hold between 1,000 and 100,000 elements.

So, my 2 questions are:

  • Can anyone help me understand those numbers?
  • If I can't estimate the proper amount of memory my code will use, then, what would be the good strategy to ensure my program will fit in memory?

Edit

Following Mai Longdong comment, I tried with two vectors, one of 262144 floats (4 bytes) and the other of 262145. Unfortunately, things don't look like a straight "per 1MB page allocation" :

  • size of the 1st vector (262144 floats) : 1048576 bytes
  • size of the 2nd vector (262145 floats) : 1179648 bytes

Delta between the two is 131072 bytes (or 128 KB). The page size would be variable? Does this make sense?

解决方案

Thrust doesn't do anything magic with memory management, the default allocator is just cudaMalloc, and what you are seeing is the driver memory manager page size selection algorithm at work. This isn't documented, and there is no indication that behaviour is consistent between platform and hardware versions.

That said, if I expand your code into something a bit more useful:

#include <iostream>
#include <vector>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

void report_mem(size_t allocd, bool first=false)
{
    size_t freeMem, totalMem;
    cudaMemGetInfo(&freeMem, &totalMem);
    if (first) 
        std::cout << "Allocated | Total Memory | Free Memory "<< std::endl;
    std::cout << allocd << ", " << totalMem << ", " << freeMem << std::endl;
}

int main() 
{
    cudaSetDevice(0);

    report_mem(0, true);
    std::vector<size_t> asizes;
    const int nallocs = 10;
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<14);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<16);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<18);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<20);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<22);

    typedef thrust::device_vector<float> dvecf_t;
    std::vector<dvecf_t*> allocs;
    auto it = asizes.begin();
    for(; it != asizes.end(); ++it) {
        dvecf_t* v = new dvecf_t(*it);
        allocs.push_back(v);
    report_mem(v->capacity() * sizeof(float));
    }
    return 0;
}

and run it on a compute 2.1 device on Windows 64 bit, I get this:

Allocated | Total Memory | Free Memory 
0, 1073741824, 1007849472
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1003655168
262144, 1073741824, 1003655168
1048576, 1073741824, 1002606592
1048576, 1073741824, 1001558016
1048576, 1073741824, 1000509440
1048576, 1073741824, 999460864
1048576, 1073741824, 998412288
1048576, 1073741824, 997363712
1048576, 1073741824, 996315136
1048576, 1073741824, 995266560
1048576, 1073741824, 994217984
1048576, 1073741824, 993169408
4194304, 1073741824, 988975104
4194304, 1073741824, 984780800
4194304, 1073741824, 980586496
4194304, 1073741824, 976392192
4194304, 1073741824, 972197888
4194304, 1073741824, 968003584
4194304, 1073741824, 963809280
4194304, 1073741824, 959614976
4194304, 1073741824, 955420672
4194304, 1073741824, 951226368
16777216, 1073741824, 934449152
16777216, 1073741824, 917671936
16777216, 1073741824, 900894720
16777216, 1073741824, 884117504
16777216, 1073741824, 867340288
16777216, 1073741824, 850563072
16777216, 1073741824, 833785856
16777216, 1073741824, 817008640
16777216, 1073741824, 800231424

which I interpret as indicating that the allocation granularity is 1MiB (1048576 or 2^20 bytes) on the platform I tested this on. Your platform might be different.

这篇关于了解推力(CUDA)内存使用情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆