Pascal CUDA8 1080Ti统一内存的速度 [英] Speed of Pascal CUDA8 1080Ti unified memory

查看:140
本文介绍了Pascal CUDA8 1080Ti统一内存的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢答案此处昨天,我想我现在已经使用Pascal 1080Ti对统一内存进行了正确的基本测试.它分配一个50GB的一维数组并将其相加.如果我理解正确,那应该是受内存限制的,因为此测试是如此简单(添加整数).但是,这需要24秒,相当于大约2GB/s.当我运行CUDA8带宽测试时,我看到了更高的速率:固定的速度为11.7GB/s,可分页的速度为8.5GB/s.

Thanks to the answers here yesterday, I think I now have a correct basic test of unified memory using Pascal 1080Ti. It allocates a 50GB single dimension array and adds it up. If I understand correctly, it should be memory bound since this test is so simple (adding integers). However, it takes 24 seconds equating to about 2GB/s. When I run the CUDA8 bandwidthTest I see higher rates: 11.7GB/s pinned and 8.5GB/s pageable.

有什么方法可以使测试运行超过24秒吗?

Is there any way to get the test to run faster than 24 seconds?

这是完整的测试代码:

$ cat firstAcc.c 

#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#include <time.h>

#define GB 50

static double wallclock()
{
  double ans = 0;
  struct timespec tp;
  if (0==clock_gettime(CLOCK_REALTIME, &tp))
      ans = (double) tp.tv_sec + 1e-9 * (double) tp.tv_nsec;
  return ans;
}

int main()
{
  int *a;

  size_t n = (size_t)GB*1024*1024*1024/sizeof(int);
  size_t s = n * sizeof(int);
  printf("n = %lu, GB = %.3f\n", n, (double)s/(1024*1024*1024));
  a = (int *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }

  setbuf(stdout, NULL);
  double t0 = wallclock();
  printf("Initializing ... ");
  for (long i = 0; i < n; ++i) {
    a[i] = i%7-3;
  }
  double t1 = wallclock();
  printf("done in %f (single CPU thread)\n", t1-t0);
  t0=t1;

  int sum=0.0;
  #pragma acc parallel loop reduction (+:sum)
  for (long i = 0; i < n; ++i) {
    sum+=a[i];
  }
  t1 = wallclock();
  printf("Sum is %d and it took %f\n", sum, t1-t0);
  free(a);
  return 0;
}

我将其编译如下:

$ pgcc -fast -acc -ta=tesla:managed:cc60 -Minfo=accel firstAcc.c
main:
     40, Accelerator kernel generated
         Generating Tesla code
         40, Generating reduction(+:sum)
         41, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     40, Generating implicit copyin(a[:13421772800])

然后我运行两次:

$ ./a.out
n = 13421772800, GB = 50.000
Initializing ... done in 36.082607 (single CPU thread)
Sum is -5 and it took 23.902612
$ ./a.out
n = 13421772800, GB = 50.000
Initializing ... done in 36.001578 (single CPU thread)
Sum is -5 and it took 24.180615

当我以这种方式设置数据时,结果(-5)是正确的.这些数字是由7个整数-3:+3组成的重复序列,当它们相加时,除末尾的2的其余部分(-3 -2 = -5)外,所有其他整数都被抵消了.

The result (-5) is correct as I setup the data that way. The numbers are repeated sequences of 7 integers -3:+3 which when summed all cancel out other than the remainder of 2 at the end (-3 -2 = -5).

可分页的bandwidthTest(CUDA 8个样本/1_Utilities)结果是:

The bandwidthTest (CUDA 8 samples/1_Utilities) result for pageable is :

$ ./bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080 Ti
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     8576.7

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11474.3

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     345412.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

我看到那条纸条.但是我应该用什么代替呢?这些测量值似乎在正确的范围内吗?

I see that note. But what should I use instead? Do these measurements seem in the right ballpark?

是否可以做些什么来使测试在6秒(50GB/8.5GB/s)而不是25s的时间内运行?

Is there anything that can be done to make the test run in more like 6 seconds (50GB / 8.5GB/s) rather than 25s?

--mode = shmoo的结果实际上显示可分页的速率更高:11GB/s.

The result with --mode=shmoo actually shows pageable reaching a higher rate: 11GB/s.

$ ./bandwidthTest --memory=pageable --mode=shmoo
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX 1080 Ti
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   1024                         160.3
   2048                         302.1
   3072                         439.2
   4096                         538.4
   5120                         604.6
   6144                         765.3
   7168                         875.0
   8192                         979.2
   9216                         1187.3
   10240                        1270.6
   11264                        1335.0
   12288                        1449.3
   13312                        1579.6
   14336                        1622.2
   15360                        1836.0
   16384                        1995.0
   17408                        2133.0
   18432                        2189.8
   19456                        2289.2
   20480                        2369.7
   22528                        2525.8
   24576                        2625.8
   26624                        2766.0
   28672                        2614.4
   30720                        2895.8
   32768                        3050.5
   34816                        3151.1
   36864                        3263.8
   38912                        3339.2
   40960                        3395.6
   43008                        3488.4
   45056                        3557.0
   47104                        3642.1
   49152                        3658.5
   51200                        3736.9
   61440                        4040.4
   71680                        4076.9
   81920                        4310.3
   92160                        4522.6
   102400                       4668.5
   204800                       5461.5
   307200                       5820.7
   409600                       6003.3
   512000                       6153.8
   614400                       6232.5
   716800                       6285.9
   819200                       6368.9
   921600                       6409.3
   1024000                      6442.5
   1126400                      6572.3
   2174976                      8239.3
   3223552                      9041.6
   4272128                      9524.2
   5320704                      9824.5
   6369280                      10065.2
   7417856                      10221.2
   8466432                      10355.7
   9515008                      10452.8
   10563584                     10553.9
   11612160                     10613.1
   12660736                     10680.3
   13709312                     10728.1
   14757888                     10763.8
   15806464                     10804.4
   16855040                     10838.1
   18952192                     10820.9
   21049344                     10949.4
   23146496                     10990.7
   25243648                     11021.6
   27340800                     11028.8
   29437952                     11083.2
   31535104                     11098.9
   33632256                     10993.3
   37826560                     10616.5
   42020864                     10375.5
   46215168                     10186.1
   50409472                     10085.4
   54603776                     10013.9
   58798080                     10004.8
   62992384                     9998.6
   67186688                     10006.4

谢谢.

$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

$ cat /usr/local/cuda-8.0/version.txt 
CUDA Version 8.0.61

推荐答案

页面错误过程显然比纯数据副本更为复杂.结果,当您通过页面错误将数据驱动到GPU时,它就无法与数据的纯副本在性能上竞争.

The page faulting process is clearly more complicated than a pure copy of data. As a result, when you drive data to the GPU by page-faulting, it cannot compete performance-wise with a pure copy of the data.

页面错误从本质上为GPU处理引入了另一种延迟. GPU是隐藏延迟的机器,但是程序员需要给它机会隐藏延迟.可以将这大致描述为公开足够的并行工作.

Page faulting essentially introduces another kind of latency for the GPU to deal with. The GPU is a latency-hiding machine, but it needs for the programmer to give it the opportunity to hide latency. This can be roughly described as exposing enough parallel work.

从表面上看,您似乎已经暴露了很多并行的工作(数据集中的〜12B元素).但是每个字节或元素检索到的工作强度很小,因此,GPU在此处隐藏与页面错误相关的延迟的机会仍然有限.换句话说,GPU具有即时能力,可以基于该GPU上正在运行的线程的最大补码(上限:2048 * SM数量)以及每个线程公开的工作来执行延迟隐藏.不幸的是,示例中每个线程中公开的工作可能很小-基本上是一个单独的添加.

On the surface of it, you seem to have exposed a lot of parallel work (~12B elements in your dataset). But the work intensity per byte or element retrieved is quite small, so as a result the GPU still has limited opportunity to hide the latency associated with page-faulting here. Stated another way, the GPU has an instantaneous capacity to perform latency hiding based on the maximum complement of threads that can be in flight on that GPU (upper bound: 2048 * # of SMs), and the work exposed in each thread. Unfortunately, the work exposed in each thread in your example could be trivially small - a single addition, basically.

帮助隐藏GPU延迟的一种方法是增加每个线程的工作量,并且有多种技术可以做到这一点.一个好的出发点是选择一种算法(如果可能),该算法具有很高的计算复杂性.矩阵矩阵乘法是每个数据元素计算复杂度高的经典示例.

One of the ways to help with GPU latency hiding is increasing the work per thread, and there are various techniques to do this. A good starting point would be to choose an algorithm (if possible) that has a high compute complexity. Matrix-matrix multiply is the classical example of large compute complexity per element of data.

在这种情况下,一些建议是认识到您要尝试做的事情是有条理的,因此从编程的角度看,通过将工作分解成多个部分并自己管理数据传输,就不那么困难.这将允许您为数据传输操作获得链路的全部带宽,实现主机->设备带宽的大约全部利用率,以及复制和计算的重叠(在此示例中很小).对于这样的简单易分解的问题,对于程序员 not 而言,使用UM/oversubscription/page-faulting是合理的.

Some suggestions in this case would be to recognize that what you are trying to do is quite orderly, and therefore not that difficult to manage from a programming point of view, by breaking up the work into pieces and managing the data transfer yourself. This will allow you to achieve the full bandwidth of the link for data transfer operations, achieve approximately full utilization of the host->device bandwidth, and (to a very small extent for this example) overlap of copy and compute. For such a straightforward and easily decomposable problem such as this, it makes sense for the programmer not to use UM/oversubscription/page-faulting.

例如,这种方法(UM/超额预订/页面错误)可能会发光的地方是一种算法,程序员难以预先预测访问模式.遍历大图形(不能一次全部存储在GPU内存中)可能是一个示例.如果您有一个遍历图的问题,每次遍历每个边都需要大量工作,那么遍历图中的节点之间的页面错误跳跳所花费的成本可能就没什么大不了了,并且简化了编程工作(不是必须明确地管理图形数据的移动)可能是值得的.

The place where this methodology (UM/oversubscription/page-faulting) may shine, for example, would be an algorithm where it's difficult for the programmer to predict the access pattern ahead of time. Traversal of a large graph (which cannot all be in GPU memory at once) might be an example. If you had a graph traversal problem with a large amount of work for each edge traversal, then the cost as you page-fault hopping node-to-node in the graph might not be a big deal, and simplification of the programming effort (not having to manage graph data movement explicitly) might be worth the cost.

关于预取,即使它可用,在这里是否有很多用还是有疑问的.预取本质上仍然取决于在执行预取请求时还有其他事情要做.当您要处理的每个数据项的工作量如此之少时,尚不清楚聪明的预取方案是否真的可以为该示例带来很多好处.我们可以想象到可能聪明,复杂的预取策略,但是最好为此类问题而设计一个分区的显式数据传输系统.

Regarding pre-fetching, it's questionable, whether it would be of much use here, even if it were available. Prefetching still essentially depends on having something else to do while the prefetch request is in flight. When you have such a low amount of work per data item to be processed, it's not clear that a clever prefetching scheme would really provide much benefit for this example. We can imagine possibly clever, complicated prefetching strategies, but such effort is probably better spent just crafting a partitioned explicit data transfer system for such a problem as this.

这篇关于Pascal CUDA8 1080Ti统一内存的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆