预取示例? [英] Prefetching Examples?

查看:32
本文介绍了预取示例?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以举一个例子或链接到一个在 GCC 中使用 __builtin_prefetch 的例子(或者只是一般的 asm 指令 prefetcht0)来获得实质性的性能优势?特别是,我希望示例满足以下条件:

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:

  1. 这是一个简单、小巧、独立的示例.
  2. 删除 __builtin_prefetch 指令会导致性能下降.
  3. __builtin_prefetch 指令替换为相应的内存访问会导致性能下降.
  1. It is a simple, small, self-contained example.
  2. Removing the __builtin_prefetch instruction results in performance degradation.
  3. Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

也就是说,我想要一个最短的示例,显示 __builtin_prefetch 执行没有它就无法管理的优化.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

推荐答案

这是我从一个更大的项目中提取的一段实际代码.(抱歉,这是我能找到的最短的一个,它从预取中获得了显着的加速.)此代码执行非常大的数据转置.

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.) This code performs a very large data transpose.

此示例使用 SSE 预取指令,该指令可能与 GCC 发出的指令相同.

This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

要运行此示例,您需要针对 x64 进行编译并拥有超过 4GB 的内存.您可以使用较小的数据大小运行它,但时间太快.

To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

#include <iostream>
using std::cout;
using std::endl;

#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>

#define ENABLE_PREFETCH


#define f_vector    __m128d
#define i_ptr       size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
    //  To be super-optimized later.

    f_vector *stop = A + L;

    do{
        f_vector tmpA = *A;
        f_vector tmpB = *B;
        *A++ = tmpB;
        *B++ = tmpA;
    }while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
    //  Transposes T.
    //  T contains x columns and x rows.
    //  Each unit is of size (block * sizeof(f_vector)) bytes.

    //Conditions:
    //  - 0 < block
    //  - 1 < x

    i_ptr row_size = block * x;
    i_ptr iter_size = row_size + block;

    //  End of entire matrix.
    f_vector *stop_T = T + row_size * x;
    f_vector *end = stop_T - row_size;

    //  Iterate each row.
    f_vector *y_iter = T;
    do{
        //  Iterate each column.
        f_vector *ptr_x = y_iter + block;
        f_vector *ptr_y = y_iter + row_size;

        do{

#ifdef ENABLE_PREFETCH
            _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif

            swap_block(ptr_x,ptr_y,block);

            ptr_x += block;
            ptr_y += row_size;
        }while (ptr_y < stop_T);

        y_iter += iter_size;
    }while (y_iter < end);
}
int main(){

    i_ptr dimension = 4096;
    i_ptr block = 16;

    i_ptr words = block * dimension * dimension;
    i_ptr bytes = words * sizeof(f_vector);

    cout << "bytes = " << bytes << endl;
//    system("pause");

    f_vector *T = (f_vector*)_mm_malloc(bytes,16);
    if (T == NULL){
        cout << "Memory Allocation Failure" << endl;
        system("pause");
        exit(1);
    }
    memset(T,0,bytes);

    //  Perform in-place data transpose
    cout << "Starting Data Transpose...   ";
    clock_t start = clock();
    transpose_even(T,block,dimension);
    clock_t end = clock();

    cout << "Done" << endl;
    cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;

    _mm_free(T);
    system("pause");
}

当我在启用 ENABLE_PREFETCH 的情况下运行它时,这是输出:

When I run it with ENABLE_PREFETCH enabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.725 seconds
Press any key to continue . . .

当我在禁用 ENABLE_PREFETCH 的情况下运行它时,这是输出:

When I run it with ENABLE_PREFETCH disabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.822 seconds
Press any key to continue . . .

因此,预取的速度提高了 13%.

So there's a 13% speedup from prefetching.

以下是更多结果:

Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release

Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz
Prefetch   : 0.868
No Prefetch: 0.960

Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz
Prefetch   : 0.725
No Prefetch: 0.822

Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz
Prefetch   : 0.718
No Prefetch: 0.796

2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz
Prefetch   : 2.273
No Prefetch: 2.666

这篇关于预取示例?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆