-ta = tesla:managed:cuda8,但是cuMemAllocManaged返回错误2:内存不足 [英] -ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

查看:244
本文介绍了-ta = tesla:managed:cuda8,但是cuMemAllocManaged返回错误2:内存不足的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是OpenACC的新手.就我对OpenMP的熟悉程度而言,我非常喜欢它.

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.

我有2张1080Ti卡,每张卡具有9GB的内存和128GB的RAM.我正在尝试一个非常基本的测试来分配一个数组,对其进行初始化,然后对其进行并行求和.这适用于8 GB,但是当我增加到10 GB时,我会遇到内存不足错误.我的理解是,有了Pascal(这些卡是)和CUDA 8的统一内存,我可以分配一个比GPU内存更大的数组,并且硬件将按需进行页面内和页面外出.

I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.

这是我完整的C代码测试:

Here's my full C code test :

$ cat firstAcc.c 

#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 10

int main()
{
  float *a;
  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  float sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}

根据本文我用:

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
 20, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop
 28, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop containing reductions
     Generated a prefetch instruction for the loop

我需要了解这些消息,但现在我不认为它们是相关的.然后我运行它:

I need to understand those messages but for now I don't think they are relevant. Then I run it :

$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)

如果我将GB更改为8,这将正常工作.由于Pascal 1080Ti和CUDA 8,我希望10GB可以工作(尽管GPU卡具有9GB).

This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.

我误会了,或者我做错了什么?预先感谢.

Have I misunderstand, or what am I doing wrong? Thanks in advance.

$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

$ cat /usr/local/cuda-8.0/version.txt 
CUDA Version 8.0.61

推荐答案

我相信这里是一个问题:

I believe a problem is here:

size_t n = GB*1024*1024*1024/sizeof(float);

当我使用g ++编译该行代码时,会收到有关整数溢出的警告.出于某些原因,PGI编译器未发出警告,但在后台也发生了同样的问题.在sn的声明之后,如果我添加这样的打印输出:

when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:

  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);  // add this line

并使用PGI 17.04编译,然后运行(在具有16GB的P100上),输出如下:

and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$

所以很明显ns不是您想要的.

so it's evident that n and s are not what you intended.

我们可以通过用ULL标记所有这些常量来解决此问题,然后一切似乎对我来说都是正确的:

We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:

$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 20ULL

int main()
{
  float *a;
  size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  double sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$

请注意,我也在上面做了另一处更改.我将sum累积变量从float更改为double.当在很小的数量上进行大幅度减少时,这对于保留某种合理"结果是必要的.

Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.

而且,正如@MatColgrove在他的回答中指出的那样,我也错过了其他一些事情.

And, as @MatColgrove pointed out in his answer, I missed a few other things as well.

这篇关于-ta = tesla:managed:cuda8,但是cuMemAllocManaged返回错误2:内存不足的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆