为什么_mm_stream_ps产生L1 / LL高速缓存未命中? [英] Why does _mm_stream_ps produce L1/LL cache misses?

查看:401
本文介绍了为什么_mm_stream_ps产生L1 / LL高速缓存未命中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图优化计算密集型算法和我种停留在一些缓存问题。我有其在应用程序的结束偶尔和随机写入和读取一次一个巨大的缓冲器。很显然,写入缓冲区产生大量高速缓存未命中,此外污染这是事后又需要计算缓存。我试图用非临时移动instrinsics,但高速缓存未命中(由Valgrind的报道和运行时的测量支持)仍时有发生。然而,进一步调查非临时的移动,我写了一个小的测试程序,它可以见下图。顺序存取,大的缓冲,只写。

 的#include<&stdio.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&time.h中GT;
#包括LT&;&smmintrin.h GT;恬无效(为const char *名称,无效(* FUNC)()){
    结构体的timespec T1,T2;
    clock_gettime(CLOCK_REALTIME,&放大器T1);
    FUNC();
    clock_gettime(CLOCK_REALTIME,&放大器; T2);
    的printf(%S:%五六\\ n,名称,(t2.tv_sec - t1.tv_sec)+(浮点)(t2.tv_nsec - t1.tv_nsec)/ 1000000000);
}const int的CACHE_LINE = 64;
const int的系数= 1024;
浮动* ARR;
INT长;无效func1的(){
    的for(int i = 0; I<长度;我++){
        改编[I] = 5.0F;
    }
}无效FUNC2(){
    的for(int i = 0; I<长度;我+ = 4){
        改编[I] = 5.0F;
        改编[I + 1] = 5.0F;
        改编[I + 2] = 5.0F;
        改编[I + 3] = 5.0F;
    }
}无效FUNC3(){
    __m128 BUF = _mm_setr_ps(5.0F,5.0F,5.0F,5.0F);
    的for(int i = 0; I<长度;我+ = 4){
        _mm_stream_ps(安培;常用3 [I],BUF);
    }
}无效FUNC4(){
    __m128 BUF = _mm_setr_ps(5.0F,5.0F,5.0F,5.0F);
    的for(int i = 0; I<长度;我+ = 16){
        _mm_stream_ps(安培;常用3 [I],BUF);
        _mm_stream_ps(安培;常用3 [4],BUF);
        _mm_stream_ps(安培;常用3 [8],BUF);
        _mm_stream_ps(安培;常用3 [12],BUF);
    }
}诠释主(){
    长度= CACHE_LINE * FACTOR *因素;    ARR =的malloc(长*的sizeof(浮动));
    恬(func1的func1的);
    免费(ARR);    ARR =的malloc(长*的sizeof(浮动));
    恬(FUNC2,FUNC2);
    免费(ARR);    ARR =的malloc(长*的sizeof(浮动));
    恬(FUNC3,FUNC3);
    免费(ARR);    ARR =的malloc(长*的sizeof(浮动));
    恬(FUNC4,FUNC4);
    免费(ARR);    返回0;
}

功能1是幼稚的做法,功能2使用循环展开。功能3使用movntps,这实际上是插入至少当我检查-O0大会。在功能4我试着发出几声movntps说明一次,帮助CPU完成其写的完美组合。我编译code。与 gcc的-g -lrt -std = gnu99 -OX -msse4.1 test.c以,其中 X 是[0..3]之一。结果是..有意思的话最好:

  -O0
FUNC1:0.407794秒。
FUNC2:0.320891秒。
FUNC3:0.161100秒。
FUNC4:0.401755秒。
-O1
FUNC1:0.194339秒。
FUNC2:0.182536秒。
FUNC3:0.101712秒。
FUNC4:0.383367秒。
-O2
FUNC1:0.108488秒。
FUNC2:0.088826秒。
FUNC3:0.101377秒。
FUNC4:0.384106秒。
-O3
FUNC1:0.078406秒。
FUNC2:0.084927秒。
FUNC3:0.102301秒。
FUNC4:0.383366秒。

正如你所看到_mm_stream_ps比当程序不是由gcc的优化,但在GCC优化时则显著失败,其目的别人快一点点。 Valgrind的报告仍然大量缓存写未命中。

所以,问题是:为什么这些(L1 + LL)高速缓存未命中仍然即使我使用NTA流指令过程中发生?为什么特别FUNC4这么慢?有人可以解释/推测这里发生了什么?


解决方案

  1. 也许,您的基准措施主要是内存分配的性能,不仅写入性能。您的操作系统可能会在的malloc 不分配内存页,但在第一次接触,你的里面FUNC * 功能。 OS还可能会做一些内存洗牌分配大量内存后,让任何基准,刚过内存分配进行的,可能是不可靠的。

  2. 您$​​ C $ c的走样问题:编译器不能保证您的数组指针将不会改变填补这个数组,所以它总是从内存中加载改编价值,而不是使用寄存器的过程。这可能会花费一些性能下降。为了避免混淆最简单的方法是复制改编长度局部变量,只使用局部变量来填充数组。有许多知名的建议,以避免全局变量。别名是原因之一。

  3. _mm_stream_ps 如果阵列由64个字节对齐效果更好。在您的code没有对齐保证(实际上,的malloc 16字节对齐的话)。这种优化是明显的,只能短距离数组。

  4. 这是一个好主意,叫 _mm_mfence _mm_stream_ps 结束后。这是必需的正确性,而不是性能。

I'm trying to optimize a computation-intensive algorithm and am kind of stuck at some cache problem. I have a huge buffer which is written occasionally and at random and read only once at the end of the application. Obviously, writing into the buffer produces lots of cache misses and besides pollutes the caches which are afterwards needed again for computation. I tried to use non-temporal move instrinsics, but the cache misses (reported by valgrind and supported by runtime measurements) still occur. However, to further investigate non-temporal moves, I wrote a little test program, which you can see below. Sequential access, large buffer, only writes.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <smmintrin.h>

void tim(const char *name, void (*func)()) {
    struct timespec t1, t2;
    clock_gettime(CLOCK_REALTIME, &t1);
    func();
    clock_gettime(CLOCK_REALTIME, &t2);
    printf("%s : %f s.\n", name, (t2.tv_sec - t1.tv_sec) + (float) (t2.tv_nsec - t1.tv_nsec) / 1000000000);
}

const int CACHE_LINE = 64;
const int FACTOR = 1024;
float *arr;
int length;

void func1() {
    for(int i = 0; i < length; i++) {
        arr[i] = 5.0f;
    }
}

void func2() {
    for(int i = 0; i < length; i += 4) {
        arr[i] = 5.0f;
        arr[i+1] = 5.0f;
        arr[i+2] = 5.0f;
        arr[i+3] = 5.0f;
    }
}

void func3() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 4) {
        _mm_stream_ps(&arr[i], buf);
    }
}

void func4() {
    __m128 buf = _mm_setr_ps(5.0f, 5.0f, 5.0f, 5.0f);
    for(int i = 0; i < length; i += 16) {
        _mm_stream_ps(&arr[i], buf);
        _mm_stream_ps(&arr[4], buf);
        _mm_stream_ps(&arr[8], buf);
        _mm_stream_ps(&arr[12], buf);
    }
}

int main() {
    length = CACHE_LINE * FACTOR * FACTOR;

    arr = malloc(length * sizeof(float));
    tim("func1", func1);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func2", func2);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func3", func3);
    free(arr);

    arr = malloc(length * sizeof(float));
    tim("func4", func4);
    free(arr);

    return 0;
}

Function 1 is the naive approach, function 2 uses loop unrolling. Function 3 uses movntps, which in fact was inserted in the assembly at least when I checked for -O0. In function 4 I tried to issue several movntps instructions at once to help the CPU do its write combining. I compiled the code with gcc -g -lrt -std=gnu99 -OX -msse4.1 test.c where X is one of [0..3]. The results are .. interesting to say at best:

-O0
func1 : 0.407794 s.
func2 : 0.320891 s.
func3 : 0.161100 s.
func4 : 0.401755 s.
-O1
func1 : 0.194339 s.
func2 : 0.182536 s.
func3 : 0.101712 s.
func4 : 0.383367 s.
-O2
func1 : 0.108488 s.
func2 : 0.088826 s.
func3 : 0.101377 s.
func4 : 0.384106 s.
-O3
func1 : 0.078406 s.
func2 : 0.084927 s.
func3 : 0.102301 s.
func4 : 0.383366 s.

As you can see _mm_stream_ps is a little faster than the others when the program is not optimized by gcc but then significantly fails its purpose when gcc optimization is turned on. Valgrind still reports lots of cache write misses.

So, questions are: Why do those (L1+LL) cache misses still occur even if I'm using NTA streaming instructions? Why is especially func4 so slow?! Can someone explain/speculate what is happening here?

解决方案

  1. Probably, your benchmark measures mostly memory allocation performance, not only write performance. Your OS may allocate memory pages not in malloc, but on first touch, inside your func* functions. OS may also do some memory shuffles after large amount of memory is allocated, so any benchmarks, performed just after memory allocations, may be not reliable.
  2. Your code has aliasing problem: compiler cannot guarantee that your array's pointer does not change in the process of filling this array, so it has to always load arr value from memory instead of using a register. This may cost some performance decrease. Easiest way to avoid aliasing is to copy arr and length to local variables and use only local variables to fill the array. There are many well-known advices to avoid global variables. Aliasing is one of the reasons.
  3. _mm_stream_ps works better if array is aligned by 64 bytes. In your code no alignment is guaranteed (actually, malloc aligns it by 16 bytes). This optimization is noticeable only for short arrays.
  4. It is a good idea to call _mm_mfence after you finished with _mm_stream_ps. This is needed for correctness, not for performance.

这篇关于为什么_mm_stream_ps产生L1 / LL高速缓存未命中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆