为什么我的8M L3缓存不大于1M阵列提供任何好处? [英] Why does my 8M L3 cache not provide any benefit for arrays larger than 1M?

查看:138
本文介绍了为什么我的8M L3缓存不大于1M阵列提供任何好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被这个问题启发,写一个简单的程序来测试我的机器的内存带宽,在每个高速缓存级别:

I was inspired by this question to write a simple program to test my machine's memory bandwidth in each cache level:

<一个href=\"http://stackoverflow.com/questions/18159455/why-vectorizing-the-loop-does-not-have-performance-improvement\">Why矢量化循环不具备的性能提升

我的code使用memset的写入缓冲区(或缓冲区)一遍又一遍,并测量速度。它还保存每个缓冲区的地址在年底进行打印。这里的清单:

My code uses memset to write to a buffer (or buffers) over and over and measures the speed. It also saves the address of every buffer to print at the end. Here's the listing:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE_KB {8, 16, 24, 28, 32, 36, 40, 48, 64, 128, 256, 384, 512, 768, 1024, 1025, 2048, 4096, 8192, 16384, 200000}
#define TESTMEM 10000000000 // Approximate, in bytes
#define BUFFERS 1

double timer(void)
{
    struct timeval ts;
    double ans;

    gettimeofday(&ts, NULL);
    ans = ts.tv_sec + ts.tv_usec*1.0e-6;

    return ans;
}

int main(int argc, char **argv)
{
    double *x[BUFFERS];
    double t1, t2;
    int kbsizes[] = SIZE_KB;
    double bandwidth[sizeof(kbsizes)/sizeof(int)];
    int iterations[sizeof(kbsizes)/sizeof(int)];
    double *address[sizeof(kbsizes)/sizeof(int)][BUFFERS];
    int i, j, k;

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
        iterations[k] = TESTMEM/(kbsizes[k]*1024);

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        // Allocate
        for (j = 0; j < BUFFERS; j++)
        {
            x[j] = (double *) malloc(kbsizes[k]*1024);
            address[k][j] = x[j];
            memset(x[j], 0, kbsizes[k]*1024);
        }

        // Measure
        t1 = timer();
        for (i = 0; i < iterations[k]; i++)
        {
            for (j = 0; j < BUFFERS; j++)
                memset(x[j], 0xff, kbsizes[k]*1024);
        }
        t2 = timer();
        bandwidth[k] = (BUFFERS*kbsizes[k]*iterations[k])/1024.0/1024.0/(t2-t1);

        // Free
        for (j = 0; j < BUFFERS; j++)
            free(x[j]);
    }

    printf("TESTMEM = %ld\n", TESTMEM);
    printf("BUFFERS = %d\n", BUFFERS);
    printf("Size (kB)\tBandwidth (GB/s)\tIterations\tAddresses\n");
    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        printf("%7d\t\t%.2f\t\t\t%d\t\t%x", kbsizes[k], bandwidth[k], iterations[k], address[k][0]);
        for (j = 1; j < BUFFERS; j++)
            printf(", %x", address[k][j]);
        printf("\n");
    }

    return 0;
}

和结果(包括缓冲区= 1):

And the results (with BUFFERS = 1):

TESTMEM = 10000000000
BUFFERS = 1
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     52.79               1220703     90b010
     16     56.48               610351      90b010
     24     57.01               406901      90b010
     28     57.13               348772      90b010
     32     45.40               305175      90b010
     36     38.11               271267      90b010
     40     38.02               244140      90b010
     48     38.12               203450      90b010
     64     37.51               152587      90b010
    128     36.89               76293       90b010
    256     35.58               38146       d760f010
    384     31.01               25431       d75ef010
    512     26.79               19073       d75cf010
    768     26.20               12715       d758f010
   1024     26.20               9536        d754f010
   1025     18.30               9527        90b010
   2048     18.29               4768        d744f010
   4096     18.29               2384        d724f010
   8192     18.31               1192        d6e4f010
  16384     18.31               596         d664f010
 200000     18.32               48          cb2ff010

我可以很容易地看到32K L1缓存和256K二级缓存的作用。我不明白的是为什么性能下降后突然memset的缓冲区的大小超过1M。我的L3缓存应该是8M。它发生得太突然太不尖细在所有的时候被超过的L1和L2高速缓存大小等。

I can easily see the effect of the 32K L1 cache and 256K L2 cache. What I don't understand is why performance drops suddenly after the size of the memset buffer exceeds 1M. My L3 cache is supposed to be 8M. It happens so suddenly too, not tapered at all like when the L1 and L2 cache size was exceeded.

我的处理器是Intel酷睿i7 3700的/ SYS /设备/系统/ CPU / CPU0 /缓存L3缓存的细节:

My processor is the Intel i7 3700. The details of the L3 cache from /sys/devices/system/cpu/cpu0/cache are:

level = 3
coherency_line_size = 64
number_of_sets = 8192
physical_line_partition = 1
shared_cpu_list = 0-7
shared_cpu_map = ff
size = 8192K
type = Unified
ways_of_associativity = 16

我想我会尝试使用多个缓冲区 - 调用memset的1M上的每一个缓冲区2,看看性能会下降。随着缓冲区= 2,我得到:

I thought I would try using multiple buffers - call memset on 2 buffers of 1M each and see if performance would drop. With BUFFERS = 2, I get:

TESTMEM = 10000000000
BUFFERS = 2
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     54.15               1220703     e59010, e5b020
     16     51.52               610351      e59010, e5d020
     24     38.94               406901      e59010, e5f020
     28     38.53               348772      e59010, e60020
     32     38.31               305175      e59010, e61020
     36     38.29               271267      e59010, e62020
     40     38.29               244140      e59010, e63020
     48     37.46               203450      e59010, e65020
     64     36.93               152587      e59010, e69020
    128     35.67               76293       e59010, 63769010
    256     27.21               38146       63724010, 636e3010
    384     26.26               25431       63704010, 636a3010
    512     26.19               19073       636e4010, 63663010
    768     26.20               12715       636a4010, 635e3010
   1024     26.16               9536        63664010, 63563010
   1025     18.29               9527        e59010, f59420
   2048     18.23               4768        63564010, 63363010
   4096     18.27               2384        63364010, 62f63010
   8192     18.29               1192        62f64010, 62763010
  16384     18.31               596         62764010, 61763010
 200000     18.31               48          57414010, 4b0c3010

看来,无论1M缓存留在L3缓存。但尝试增加或者缓冲区的大小非常轻微,性能下降。

It appears that both 1M buffers stay in the L3 cache. But try to increase the size of either buffer ever so slightly and the performance drops.

我已经用编译-O3。它没有太大的区别(除了可能在缓冲区展开的循环)。我试着用-O0,它是除L1的速度是一样的。 gcc版本4.9.1是

I've been compiling with -O3. It doesn't make much difference (except possibly unrolling the loops over BUFFERS). I tried with -O0 and it's the same except for the L1 speeds. gcc version is 4.9.1.

要总结一下,我有2个部分的问题:

To summarize, I have a 2-part question:


  1. 为什么我的8 MB三级高速缓存不能提供任何好处的内存大于1M块?

  2. 为什么性能下降这么突然?

正如加布里埃尔南的建议,我跑我的code与 PERF 使用缓冲区= 1在一个时间只有一个缓冲区的大小。这是完整的命令:


As suggested by Gabriel Southern, I ran my code with perf using BUFFERS=1 with only one buffer size at a time. This was the full command:

perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -r 100 ./a.out 2> perfout.txt

-r 表示 PERF 将运行a.out的100倍,并返回的平均统计数据。

The -r means that perf will run a.out 100 times and return the average statistics.

PERF 的输出,与的#define SIZE_KB {1024}

 Performance counter stats for './a.out' (100 runs):

         1,508,798 dTLB-loads                                                    ( +-  0.02% )
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
       625,967,550 dTLB-stores                                                   ( +-  0.00% )
             1,503 dTLB-store-misses                                             ( +-  0.79% )

       0.360471583 seconds time elapsed                                          ( +-  0.79% )

的#define SIZE_KB {1025}

 Performance counter stats for './a.out' (100 runs):

         1,670,402 dTLB-loads                                                    ( +-  0.09% )
                 0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
       626,099,850 dTLB-stores                                                   ( +-  0.00% )
             2,115 dTLB-store-misses                                             ( +-  2.19% )

       0.503913416 seconds time elapsed                                          ( +-  0.06% )

所以似乎更TLB缺失与1025K的缓冲区。然而,这种大小的缓冲区,该程序的功能大约9500 memset的的调用,因此它仍然是每 memset的调用。

So there does seem to be more TLB misses with the 1025K buffer. However, with this size buffer, the program does about 9500 calls of memset, so it is still less than 1 miss per memset call.

推荐答案

memset的版本开始初始化的内存大于1 MB的区域时,使用非临时商店。其结果是,CPU并不这些行存储在它的高速缓存,即使你的L3缓存为大于1 MB。因此,该性能是通过在缓冲值大于1 MB系统中的可用存储器带宽的限制。

Short Answer:

Your version of memset starts using non-temporal stores when initializing a region of memory larger than 1 MB. As a result the CPU does not store these lines in its cache, even though your L3 cache is larger than 1 MB. Consequently the performance is limited by the available memory bandwidth in the system for buffer values larger than 1 MB.

我测试了code你提供几个不同的系统,最初重点查处TLB,因为我认为有可能是第二级TLB抖动。然而,没有我收集到的数据证实了假设。

I tested the code you provided on several different systems and initially focused on investigating the TLB because I thought that there might be thrashing in the 2nd level TLB. However, none of the data I collected confirmed that hypothesis.

有人说我测试用的Arch Linux有最新版本的glibc的,而其他人使用的Ubuntu 10.04,它使用eglibc的一个旧版本的系统。我能够重现与多个不同的CPU架构进行测试时使用静态链接的二进制文件时,在问题中所述的行为。我专注于的行为是在运行时 SIZE_KB 1024 键,当它是<$ C之间有显著差异$ C> 1025 。性能差异是在慢而快的版本执行code的变化解释。

Some of the systems that I tested used Arch Linux which has the latest version of glibc, while others used Ubuntu 10.04 which uses an older version of eglibc. I was able to reproduce the behavior described in the question when using a statically linked binary when testing with multiple different CPU architectures. The behavior that I focused on was a significant difference in runtime between when SIZE_KB was 1024 and when it was 1025. The performance difference is explained by a change in the code executed for the slow and fast versions.

我用 PERF纪录 PERF的注释来收集执行大会code一丝看什么热code路径了。在code,显示下方使用以下格式:

I used perf record and perf annotate to collect a trace of the executing assembly code to see what the hot code path was. The code is displayed below using the following format:

时间百分比执行指令|地址|指令

我复制了一个省略大部分地址和具有线路连接环回边缘和循环头短版的热循环。

I've copied the hot loop from the shorter version that omits most of the address and has a line connecting the loop back edge and loop header.

有关Arch Linux的编译版本的热循环是(用于1024和1025尺寸):

For the version compiled on Arch Linux the hot loop was (for both 1024 and 1025 sizes):

  2.35 │a0:┌─+movdqa %xmm8,(%rcx)
 54.90 │   │  movdqa %xmm8,0x10(%rcx)
 32.85 │   │  movdqa %xmm8,0x20(%rcx)
  1.73 │   │  movdqa %xmm8,0x30(%rcx)
  8.11 │   │  add    $0x40,%rcx      
  0.03 │   │  cmp    %rcx,%rdx       
       │   └──jne    a0

有关在Ubuntu 10.04二进制热循环尺寸为1024上运行时是:

For the Ubuntu 10.04 binary the hot loop when running with a size of 1024 was:

       │a00:┌─+lea    -0x80(%r8),%r8
  0.01 │    │  cmp    $0x80,%r8     
  5.33 │    │  movdqa %xmm0,(%rdi)  
  4.67 │    │  movdqa %xmm0,0x10(%rdi)
  6.69 │    │  movdqa %xmm0,0x20(%rdi)
 31.23 │    │  movdqa %xmm0,0x30(%rdi)
 18.35 │    │  movdqa %xmm0,0x40(%rdi)
  0.27 │    │  movdqa %xmm0,0x50(%rdi)
  3.24 │    │  movdqa %xmm0,0x60(%rdi)
 16.36 │    │  movdqa %xmm0,0x70(%rdi)
 13.76 │    │  lea    0x80(%rdi),%rdi 
       │    └──jge    a00    

对于具有1025的缓冲区大小的热循环是运行在Ubuntu 10.04版本:

For the Ubuntu 10.04 version running with a buffer size of 1025 the hot loop was:

       │a60:┌─+lea    -0x80(%r8),%r8  
  0.15 │    │  cmp    $0x80,%r8       
  1.36 │    │  movntd %xmm0,(%rdi)    
  0.24 │    │  movntd %xmm0,0x10(%rdi)
  1.49 │    │  movntd %xmm0,0x20(%rdi)
 44.89 │    │  movntd %xmm0,0x30(%rdi)
  5.46 │    │  movntd %xmm0,0x40(%rdi)
  0.02 │    │  movntd %xmm0,0x50(%rdi)
  0.74 │    │  movntd %xmm0,0x60(%rdi)
 40.14 │    │  movntd %xmm0,0x70(%rdi)
  5.50 │    │  lea    0x80(%rdi),%rdi 
       │    └──jge    a60

这里的关键区别在于,较慢的速度是使用 movntd 的说明,而更快的版本中使用 MOVDQA 的说明。英特尔软件开发人员手册说以下有关非暂时专卖店:

The key difference here is that the slower version was using movntd instructions while the faster versions used movdqa instructions. The Intel Software Developers manual says the following about non-temporal stores:

有关特别WC内存类型,处理器永远不会出现阅读
  数据到高速缓存层次结构。相反,非暂时提示可
  通过加载一个临时的内部缓冲区与实现
  对准高速缓存线的等效没有这个数据填充到
  缓存。

For WC memory type in particular, the processor never appears to read the data into the cache hierarchy. Instead, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache.

因此​​,这似乎在哪里使用 memset的与大于1 MB的值不适合在高速缓存来解释的行为。接下来的问题是,为什么会出现在Ubuntu 10.04系统和Arch Linux的系统,为什么1 MB被选定为分界点之间的差异。为了研究这个问题,我看了看glibc的源$ C ​​$ C:

So this seems to explain the behavior where using memset with values larger than 1 MB don't fit in the cache. The next question is why there is a difference between the Ubuntu 10.04 system and the Arch Linux system, and why 1 MB is selected as a cutoff point. To investigate that question I looked at the glibc source code:

看着 sysdeps / x86_64的/ memset.S 第一次提交,我发现有趣的是<一个href=\"https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=commit;h=b2b671b677d92429a3d41bf451668f476aa267ed\">b2b671b677d92429a3d41bf451668f476aa267ed

Looking at the glibc git repo at sysdeps/x86_64/memset.S the first commit I found interesting was b2b671b677d92429a3d41bf451668f476aa267ed

在提交描述是:

在x64更快的memset

Faster memset on x64

本实施多种方式加快memset的。首先是避免
  昂贵的跳转。二是用事实memset的参数
  最的时间对齐8字节。

This implementation speed up memset in several ways. First is avoiding expensive computed jump. Second is using fact that arguments of memset are most of time aligned to 8 bytes.

基准测试结果上:
  kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2

Benchmark results on: kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2

网站引用有一些有趣的分析数据。

And the website referenced has some interesting profiling data.

的<一个href=\"https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blobdiff;f=sysdeps/x86_64/memset.S;h=6c69f4b442cdbe9a3397272f94296e228b884323;hp=b393efe4457a2861b28318a015c0f41943f390ae;hb=b2b671b677d92429a3d41bf451668f476aa267ed;hpb=2d48b41c8fa610067c4d664ac2339ae6ca43e78c\">diff的承诺表明, memset的的code被简化了很多,非暂时的商店将被删除。这与从什么Arch Linux的异型code所示。

The diff of the commit shows that the code for memset is simplified a lot and the non-temporal stores are removed. This matches what the profiled code from Arch Linux shows.

纵观<一个href=\"https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/x86_64/memset.S;h=b393efe4457a2861b28318a015c0f41943f390ae;hb=80f844c9d898f97e8c9cf7f2571fa1eca46acd46#l865\">older code 我看到是否使用非临时存储的选择出现了利用描述为最大的高速缓存大小

Looking at the older code I saw that the choice of whether to use non-temporal stores appeared to make use of a value described as The largest cache size

L(byte32sse2_pre):

    mov    __x86_shared_cache_size(%rip),%r9d  # The largest cache size
    cmp    %r9,%r8
    ja     L(sse2_nt_move_pre)

在code计算这是:<一href=\"https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/x86_64/cacheinfo.c;hb=568035b7874a099087b77f7bba3e36a1173787b0\">sysdeps/x86_64/cacheinfo.c

虽然看起来像有code计算实际共享缓存大小,默认值也是<一个href=\"https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/x86_64/cacheinfo.c;hb=568035b7874a099087b77f7bba3e36a1173787b0#l517\">1 MB 的:

Although it looks like there is code for calculating the actual shared cache size, the default value is also 1 MB:

long int __x86_64_shared_cache_size attribute_hidden = 1024 * 1024;

所以我怀疑,无论是默认值正在被使用,但可能有一些其他的原因,code是选择1MB作为分界点。

So I suspect that either the default value is being used, but there may be some other reason that the code is selecting 1MB as the cutoff point.

在整体的回答你的问题似乎是设置的内存量比较大的区域时, memset的的系统上的版本使用非临时存储无论是哪种情况1 MB。

In either case the overall answer to your question appears to be that the version of memset on your system is using non-temporal stores when setting a region of memory larger than 1 MB.

这篇关于为什么我的8M L3缓存不大于1M阵列提供任何好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆