阵列大小和复制性能 [英] Array size and copy performance

查看:118
本文介绍了阵列大小和复制性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我肯定这已经得到回答,但是我找不到很好的解释.

I'm sure this has been answered before, but I can't find a good explanation.

我正在编写一个图形程序,其中管道的一部分将体素数据复制到OpenCL页面锁定(固定)的内存中.我发现此复制过程是一个瓶颈,并对简单的 std :: copy 的性能进行了一些测量.数据是浮动的,我要复制的每个数据块的大小约为64 MB.

I'm writing a graphics program where a part of the pipeline is copying voxel data to OpenCL page-locked (pinned) memory. I found that this copy procedure is a bottleneck and made some measurements on the performance of a simple std::copy. The data is floats, and every chunk of data that I want to copy is around 64 MB in size.

这是我的原始代码,未经任何基准测试:

This is my original code, before any attempts at benchmarking:

std::copy(data, data+numVoxels, pinnedPointer_[_index]);

其中 data 是浮点指针, numVoxels 是无符号整数,而 pinnedPointer _ [_ index] 是引用固定的OpenCL缓冲区的浮点指针

Where data is a float pointer, numVoxels is an unsigned int and pinnedPointer_[_index] is the float pointer referencing a pinned OpenCL buffer.

由于这样做的性能较慢,所以我决定尝试复制较小的数据部分,然后查看获得的带宽类型.我使用boost :: cpu_timer进行计时.我已经尝试运行了一段时间,并且平均进行了数百次运行,获得了相似的结果.这是相关的代码以及结果:

Since I got slow performance of that, I decided to try copying smaller parts of the data instead and see what kind of bandwidth I got. I used boost::cpu_timer for timing. I've tried running it for some time as well as averaging over a couple of hundred runs, getting similar results. Here is relevant code along with the results:

boost::timer::cpu_timer t;                                                    
unsigned int testNum = numVoxels;                                             
while (testNum > 2) {                                                         
  t.start();                                                                  
  std::copy(data, data+testNum, pinnedPointer_[_index]);                      
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9 ;                                 
  int size = testNum*sizeof(float);                                           
  double GB = (double)size / 1073741842.0;                                    
  // Print results  
  testNum /= 2;                                                               
}

Copied 67108864 bytes in 0.032683s, 1.912315 GB/s
Copied 33554432 bytes in 0.017193s, 1.817568 GB/s
Copied 16777216 bytes in 0.008586s, 1.819749 GB/s
Copied 8388608 bytes in 0.004227s, 1.848218 GB/s
Copied 4194304 bytes in 0.001886s, 2.071705 GB/s
Copied 2097152 bytes in 0.000819s, 2.383543 GB/s
Copied 1048576 bytes in 0.000290s, 3.366923 GB/s
Copied 524288 bytes in 0.000063s, 7.776913 GB/s
Copied 262144 bytes in 0.000016s, 15.741867 GB/s
Copied 131072 bytes in 0.000008s, 15.213149 GB/s
Copied 65536 bytes in 0.000004s, 14.374742 GB/s
Copied 32768 bytes in 0.000003s, 10.209962 GB/s
Copied 16384 bytes in 0.000001s, 10.344942 GB/s
Copied 8192 bytes in 0.000001s, 6.476566 GB/s
Copied 4096 bytes in 0.000001s, 4.999603 GB/s
Copied 2048 bytes in 0.000001s, 1.592111 GB/s
Copied 1024 bytes in 0.000001s, 1.600125 GB/s
Copied 512 bytes in 0.000001s, 0.843960 GB/s
Copied 256 bytes in 0.000001s, 0.210990 GB/s
Copied 128 bytes in 0.000001s, 0.098439 GB/s
Copied 64 bytes in 0.000001s, 0.049795 GB/s
Copied 32 bytes in 0.000001s, 0.049837 GB/s
Copied 16 bytes in 0.000001s, 0.023728 GB/s

在复制65536-262144字节的块时有一个明显的带宽峰值,并且带宽比复制整个阵列要高得多(15 vs 2 GB/s).

There is a clear bandwidth peak at copying chunks of 65536-262144 bytes, and the bandwidth is very much higher than copying the full array (15 vs 2 GB/s).

知道这一点后,我决定尝试另一件事并复制了整个数组,但是使用了对 std :: copy 的重复调用,其中每个调用仅处理了数组的一部分.尝试不同的块大小,这些是我的结果:

Knowing this, I decided to try another thing and copied the full array, but using repeated calls to std::copy where each call just handled part of the array. Trying different chunk sizes, these are my results:

unsigned int testNum = numVoxels;                                             
unsigned int parts = 1;                                                       
while (sizeof(float)*testNum > 256) {                                         
  t.start();                                                                  
  for (unsigned int i=0; i<parts; ++i) {                                      
    std::copy(data+i*testNum, 
              data+(i+1)*testNum, 
              pinnedPointer_[_index]+i*testNum);
  }                                                                           
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9;                                  
  int size = testNum*sizeof(float);                                           
  double GB = parts*(double)size / 1073741824.0;                              
  // Print results
  parts *= 2;                                                                 
  testNum /= 2;                                                               
}      

Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s
Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s
Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s
Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s
Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s
Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s
Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s
Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s
Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s
Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s
Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s
Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s
Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s
Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s
Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s
Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s
Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s
Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s

减小块大小似乎实际上起到了很大的作用,但是我仍然无法达到15 GB/s的速度.

It seems like decreasing the chunk size actually has a significant effect, but I can't still get anywhere near 15 GB/s.

我运行的是64位Ubuntu,GCC优化没有太大区别.

I run 64 bit Ubuntu, GCC optimization doesn't do much difference.

  1. 为什么阵列大小会以这种方式影响带宽?
  2. OpenCL固定内存是否起作用?
  3. 优化大型阵列副本的策略是什么?

推荐答案

我很确定您正在遇到缓存溢出问题.如果您在下一个回合中用写入的数据填充高速缓存,则需要一些数据,高速缓存将不得不从内存中读取该数据,但是首先,它需要在高速缓存中找到一些空间-因为所有数据[或至少其中的一部分)是脏的",因为已将其写入,因此需要将其写出到RAM.接下来,我们向缓存中写入新的数据,从而抛出另一部分脏的数据(或我们之前阅读的内容).

I'm pretty sure you are running into cache-thrashing. If you fill the cache with data you've written, next time round, some data is needed, the cache will have to read that data from the memory, but FIRST it needs to find some space in the cache - because all the data [or at least a lot of it] is "dirty" because it has been written to, it needs to be written out to RAM. Next we write a new bit of data to the cache, which throws out another bit of data that is dirty (or something we read in earlier).

在汇编程序中,我们可以通过使用非时间"移动指令来克服这一问题.例如,SSE指令 movntps .该指令将避免将内容存储在缓存中".

In assembler, we can overcome this by using a "non-temporal" move instruction. The SSE instruction movntps for example. This instruction will "avoid storing things in the cache".

您还可以通过不混合读写操作来获得更好的性能-使用较小的缓冲区(固定大小的数组),例如4-16KB,然后将数据复制到该缓冲区中,然后将该缓冲区写入您所处的新位置想要它.再次,理想情况下,使用非时间写入,因为即使在这种情况下,这也将提高吞吐量-但仅使用块"读取然后写入,而不是读取一个,再写入一个,将会更快.

You can also get better performance by not mixing reads and writes - use a small buffer [fixed size array] of say 4-16KB, and copy data to that buffer, then write that buffer to the new place where you want it. Again, ideally use non-temporal writes, as that will improve the throughput even in this case - but just using "blocks" to read and then write, rather than read one, write one, will go much faster.

类似这样的东西:

   float temp[2048]; 
   int left_to_do = numVoxels;
   int offset = 0;

   while(left_to_do)
   {
      int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); 
      std::copy(data+offset, data+offset+block, temp);                      
      std::copy(temp, temp+block, pinnedPointer_[_index+offet]);                      
      offset += block;
      left_to_do -= block;
   }

尝试一下,看看它是否可以改善.可能不会...

Try that, and see if it improves things. It may not...

Edit2:我应该解释一下这是更快的,因为您每次都重复使用相同的缓存来将数据加载到其中,并且通过不将读写混合在一起,我们可以从内存本身获得更好的性能.

I should explain that this is faster because you are re-using the same bit of cache to load data into every time, and by not mixing the reading and writing, we get better performance from the memory itself.

这篇关于阵列大小和复制性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆