使用预取加速随机存储器访问 [英] Speed up random memory access using prefetch

查看:185
本文介绍了使用预取加速随机存储器访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过使用预取来加速单个程序。我的程序的目的只是为了测试。这是它的作用:



  1. 它使用两个相同大小的int缓冲区

  2. 它逐一读取第一个缓冲区的所有值
  3. 它读取第二个缓冲区中索引处的值

  4. 它将从第二个缓冲区取得的所有值相加

  5. 执行之前的所有步骤以获取更大和更大的值

  6. 最后,自愿和非自愿CPU的数量


第一次,第一次缓冲区中的值包含它的索引值(参见下面代码中的函数 createIndexBuffer )。



在我的程序代码中会更清楚:

pre $ code > #include< stdio.h>
#include< stdlib.h>
#include< limits.h>
#include< sys / time.h>

#define BUFFER_SIZE((unsigned long)4096 * 100000)


unsigned int randomUint()
{
int value = rand ()%UINT_MAX;
返回值;


$ b $ unsigned int * createValueBuffer()
{
unsigned int * valueBuffer =(unsigned int *)malloc(BUFFER_SIZE * sizeof(unsigned int ));
for(unsigned long i = 0; i< BUFFER_SIZE; i ++)
{
valueBuffer [i] = randomUint();
}

return(valueBuffer);



unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer =(unsigned int *)malloc(BUFFER_SIZE * sizeof(unsigned int )); (unsigned long i = 0; i< BUFFER_SIZE; i ++)

{
indexBuffer [i] = i;
}

return(indexBuffer);


$ b unsigned long long computeSum(unsigned int * indexBuffer,unsigned int * valueBuffer)
{
unsigned long long sum = 0;

(unsigned int i = 0; i< BUFFER_SIZE; i ++)
{
unsigned int index = indexBuffer [i];
sum + = valueBuffer [index];
}

return(sum);


$ b unsigned int computeTimeInMicroSeconds()
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();

struct timeval startTime,endTime;
gettimeofday(& startTime,NULL);

unsigned long long sum = computeSum(indexBuffer,valueBuffer);

gettimeofday(& endTime,NULL);

printf(Sum =%llu\\\
,sum);
free(indexBuffer);
free(valueBuffer); ((endTime.tv_sec - startTime.tv_sec)* 1000 * 1000)+(endTime.tv_usec - startTime.tv_usec);

return



$ b int main()
{
printf(sizeof buffers =%ldMb\\\
,BUFFER_SIZE * sizeof (unsigned int)/(1024 * 1024));
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
printf(Time:%u micro-seconds =%.3f seconds\\\
,timeInMicroSeconds,(double)timeInMicroSeconds /(1000 * 1000));
}

如果我启动它,我会得到以下输出:

  $ gcc TestPrefetch.c -O3 -o TestPrefetch&& ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439813150288855829
时间:201172微秒= 0.201秒

快速且快速!
据我所知(我可能是错的),有这样一个快速程序的原因之一是,当我顺序访问我的两个缓冲区时,数据可以预取在CPU缓存中。



我们可以使它更加复杂,以便数据(几乎)在CPU高速缓存中处于优先地位。例如,我们可以更改 createIndexBuffer 函数:

  unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer =(unsigned int *)malloc(BUFFER_SIZE * sizeof(unsigned int)); (unsigned long i = 0; i< BUFFER_SIZE; i ++)

indexBuffer [i] = rand()%BUFFER_SIZE;
}

return(indexBuffer);
}

让我们再次尝试该程序:

  $ gcc TestPrefetch.c -O3 -o TestPrefetch&& ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439835307963131237
时间:3730387微秒= 3.730秒

慢18倍以上!!!



我们现在到了我的问题。鉴于新的 createIndexBuffer 函数,我想通过预取来加速 computeSum 函数

  unsigned long long computeSum(unsigned int * indexBuffer,unsigned int * valueBuffer)
{
unsigned long long sum = 0; (unsigned int i = 0; i< BUFFER_SIZE; i ++)



__builtin_prefetch((char *)& indexBuffer [i + 1],0,0 );
unsigned int index = indexBuffer [i];
sum + = valueBuffer [index];
}

return(sum);

当然,我也必须更改我的 createIndexBuffer ,以便它分配一个包含多个元素的缓冲区



重新启动我的程序: not better STRONG>!因为预取可能比一个for循环迭代慢,所以我可能不会预先获取一个元素,而是可以在两个元素之前预取

  __builtin_prefetch( (char *)& indexBuffer [i + 2],0,0); 

不会更好!两个循环迭代? 不好?三? **我尝试了它直到50(!!!),但我无法提高我的函数 computeSum 的性能。



我能不能帮助理解为什么
非常感谢您的帮助

解决方案

我相信上面的代码是由CPU自动优化的,没有任何进一步的空间用于手动优化。

1。主要问题是 indexBuffer 被顺序访问。硬件预取器可以感知并自动预取更多值,而无需手动调用预取。因此,在迭代#i中,值 indexBuffer [i + 1] indexBuffer [i + 2] ,..已经在缓存中。 (顺便说一句,没有必要在数组的末尾添加人工元素:预取指令会默默地忽略内存访问错误)。



你真正需要的是什么做的是预取 valueBuffer 代替:

  __ builtin_prefetch((char *) & valueBuffer [indexBuffer [i + 1]],0,0); 

2。但添加上面的代码行无助于这种简单的情况。访问内存的成本是数百个周期,而添加指令是〜1个周期。你的代码已经花费了99%的时间访问内存。添加手动预取可以使这个循环更快,并且更好。

如果您的数学更重(尝试它),手动预取会非常有效,例如使用表达式有大量未优化的外部分割(每个20-30个周期)或调用一些数学函数(log,sin)。



3。但即使这样也不能保证有帮助。循环迭代之间的依赖性非常弱,只能通过 sum 变量。这允许CPU推测性地执行指令:它可以同时开始获取 valueBuffer [i + 1] ,同时仍然为 valueBuffer [i]


I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:

  1. It uses two int buffers of the same size
  2. It reads one-by-one all the values of the first buffer
  3. It reads the value at the index in the second buffer
  4. It sums all the values taken from the second buffer
  5. It does all the previous steps for bigger and bigger
  6. At the end, I print the number of voluntary and involuntary CPU

In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer in the code just below) .

It will be more clear in the code of my program:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>

#define BUFFER_SIZE ((unsigned long) 4096 * 100000)


unsigned int randomUint()
{
  int value = rand() % UINT_MAX;
  return value;
}


unsigned int * createValueBuffer()
{
  unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    valueBuffer[i] = randomUint();
  }

  return (valueBuffer);
}


unsigned int * createIndexBuffer()
{
  unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = i;
  }

  return (indexBuffer);
}


unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
  unsigned long long sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    unsigned int index = indexBuffer[i];
    sum += valueBuffer[index];
  }

  return (sum);
}


unsigned int computeTimeInMicroSeconds()
{
  unsigned int * valueBuffer = createValueBuffer();
  unsigned int * indexBuffer = createIndexBuffer();

  struct timeval startTime, endTime;
  gettimeofday(&startTime, NULL);

  unsigned long long sum = computeSum(indexBuffer, valueBuffer);

  gettimeofday(&endTime, NULL);

  printf("Sum = %llu\n", sum);
  free(indexBuffer);
  free(valueBuffer);

  return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);

}


int main()
{
  printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
  unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
  printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}

If I launch it, I get the following output:

$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds

Quick and fast!!! According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.

We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer function in:

unsigned int * createIndexBuffer()
{
  unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
  for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
  {
    indexBuffer[i] = rand() % BUFFER_SIZE;
  }

  return (indexBuffer);
}

Let's try the program once again:

$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch 
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds

More than 18 times slower!!!

We now arrive to my problem. Given the new createIndexBuffer function, I would like to speed up computeSum function using prefetch

unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
  unsigned long long sum = 0;

  for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
  {
    __builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
    unsigned int index = indexBuffer[i];
    sum += valueBuffer[index];
  }

  return (sum);
}

of course I also have to change my createIndexBuffer in order it allocates a buffer having one more element

I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before

    __builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);

not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum.

Can I would like help to understand why Thank you very much for your help

解决方案

I believe that above code is automatically optimized by CPU without any further space for manual optimization.

1. Main problem is that indexBuffer is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1], indexBuffer[i+2],... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).

What you really need to do is to prefetch valueBuffer instead:

__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);

2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.

Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).

3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1] concurrently while still executing math for valueBuffer[i].

这篇关于使用预取加速随机存储器访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆