基于访问类型的CPU高速缓存关键跨度测试给出意想不到的结果 [英] CPU cache critical stride test giving unexpected results based on access type

查看:269
本文介绍了基于访问类型的CPU高速缓存关键跨度测试给出意想不到的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

受到这个最近对SO的问题和答案的启发,这让我觉得很无知,我决定花一些时间来了解 CPU缓存的更多信息,并写了一个小程序来验证我是否得到这一切是正确的可能不是,恐怕)。我会先写下我的预期所依据的假设,如果这些错误的话,你可能会阻止我。基于我已阅读的内容:

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I'd spend some time to learn more about CPU caching and wrote a small program to verify whether I am getting this whole thing right (most likely not, I'm afraid). I'll first write down the assumptions that underlie my expectations, so you could possibly stop me here if those are wrong. Based on what I've read, in general:


  1. code> -way关联缓存分为 s 组,每个组包含 n 行,每行具有固定大小 L ;

  2. 每个主内存地址 A 一个设置的 n 高速缓存行
  3. 的 ;
  4. 通过将地址空间划分为每个大小为一个高速缓存线的时隙,然后计算 A的索引,可以找到映射到 A 的槽( I = A / L ),最后执行模运算将索引映射到目标集 T T = I%s );

  5. 高速缓存写未命中,因为CPU在等待主存储器行被取出时不太可能停止并保持空闲。

  1. An n-way associative cache is divided into s sets, each containing n lines, each line having a fixed size L;
  2. Each main memory address A can be mapped into any of the n cache lines of one set;
  3. The set into which address A is mapped can be found by splitting the address space into slots each of the size of one cache line, then computing the index of A's slot (I = A / L), and finally performing a modulo operation to map the index into the target set T (T = I % s);
  4. A cache read miss causes a higher delay than a cache write miss, because the CPU is less likely to stall and stay idle while waiting for the main memory line to be fetched.

我的第一个问题是:这些假设是否正确?

使用这些概念,以便我可以实际看到它们对程序有具体的影响。我写了一个简单的测试,分配 B 字节的内存缓冲区,并以给定的步长固定增量重复访问该缓冲区的位置 (意味着如果 B 为14,步骤为3,则我只重复访问位置0, 3,6,9和12 - 并且如果 B 是13,14或15也是如此):

Assuming they are, I tried to play a bit with these concepts so I could actually see them having a concrete impact on a program. I wrote a simple test that allocates a memory buffer of B bytes and repeatedly accesses locations of that buffer with fixed increments of a given step from the beginning of the buffer (meaning that if B is 14 and the step is 3, I repeatedly visit only locations 0, 3, 6, 9, and 12 - and the same is true if B is 13, 14, or 15):

int index = 0;
for (int i = 0; i < REPS; i++)
{
    index += STEP;
    if (index >= B) { index = 0; }
    buffer[index] = ...; // Do something here!
}

由于上述假设,我的预期是:

Due to the above assumptions, my expectations were that:


  1. 当设置 STEP 等于 critical stride 缓存线乘以缓存中的集合数,或 L * s ),性能应明显差于 STEP 设置为例如( L * s)+ 1 ,因为我们将只访问映射到相同设置,迫使缓存线被更频繁地从该集合中逐出并导致更高的缓存缺失率;

  2. STEP 等于临界步长,性能不应受影响大于缓冲区的大小 B 因为这不太小(否则将访问太少的位置,并且将有较少的高速缓存未命中);否则, B 的性能应该受影响,因为使用更大的缓冲区,我们更有可能访问被映射到不同集合的位置if STEP 不是2的倍数);

  3. 性能当从写入每个缓冲区位置读取比只写入时:写入存储器位置不需要等待相应的行因此访问映射到同一集合的内存位置(再次通过使用 STEP 的关键步幅)的事实应该有轻微的影响。

  1. When setting STEP equal to the critical stride (i.e. the size of a cache line times the number of sets in the cache, or L * s), performance should be significantly worse than when STEP is set to, for instance, (L * s) + 1, because we would be accessing only memory locations that get mapped into the same set, forcing a cache line to be evicted more frequently from that set and resulting in a higher rate of cache misses;
  2. When STEP is equal to the critical stride, performance should not be affected by the size B of the buffer, as long as this is not too small (otherwise too few locations would be visited and there would be less cache misses); otherwise, the performance should be affected by B, because with a bigger buffer we are more likely to access locations that get mapped into different sets (especially if STEP is not a multiple of 2);
  3. The performance loss should be worse when reading from and writing to each buffer location than when only writing to those locations: writing to a memory location should not require waiting for the corresponding line to be fetched, so the fact of accessing memory locations that map into the same set (again, by using the critical stride as STEP) should have a minor impact.

因此,我使用了 RightMark内存分析器找出我的L1 CPU数据缓存的参数,调整我的程序中的大小,并尝试了它。这是我写主循环( onlyWriteToCache 是可以从命令行设置的标志):

So I used RightMark Memory Analyzer to find out the parameters of my L1 CPU data cache, tuned up the sizes in my program, and tried it out. This is how I wrote the main cycle (onlyWriteToCache is a flag that can be set from command line):

    ...
    for (int i = 0; i < REPS; i++)
    {
        ...
        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
        }
    }


  • 预期1)和2)被确认;

  • 预期3)已确认。

  • Expectations 1) and 2) were confirmed;
  • Expectation 3) was not confirmed.

打我,让我认为有一些我不是很正确。当 B 为256 MB且 STEP 等于临界步长时,测试(使用-O3在GCC 4.7 .1)显示:

This fact strikes me, and makes me think there is something I did not get quite right. When B is 256 MB and STEP is equal to the critical stride, the test (compiled with -O3 on GCC 4.7.1) shows that:


  • 循环的只写版本受到平均〜6x 性能损失(6.234s vs 1.078s);

  • 周期的读写版本受到平均〜1.3x 性能损失(6.671s vs 5.25s )。

  • The write-only version of the cycle suffers from an average ~6x performance loss (6.234s vs 1.078s);
  • The read-write version of the cycle suffers from an average ~1.3x performance loss (6.671s vs 5.25s).

所以我的第二个问题是:为什么会有这种差异?

So my second question is: why this difference? I would expect the performance loss to be higher when reading and writing than when only writing.

为了完整起见,下面是我写的程序测试,其中常量反映了我的机器的硬件参数:L1 8路关联数据高速缓存的大小为32 KB,大小 L 每个高速缓存线是64字节,这给出总共64组(CPU具有相同大小并具有相同线路尺寸的单独的L1 8路指令高速缓存)。

For the sake of completeness, below is the program I wrote for doing the tests, where the constants reflect the hardware parameters of my machine: the size of the L1 8-way associative data cache is 32 KB and the size L of each cache line is 64 bytes, which gives a total of 64 sets (the CPU has a separate L1 8-way instruction cache of the same size and with identical line size).

#include <iostream>
#include <ctime>
#include <cstdlib>
#include <iterator>
#include <algorithm>

using namespace std;

// Auxiliary functions

constexpr int pow(int base, int exp)
{
    return ((exp == 0) ? 1 : base * pow(base, exp - 1));
}

int main(int argc, char* argv[])
{
    //======================================================================
    // Define behavior from command-line arguments
    //======================================================================

    bool useCriticalStep = false;
    bool onlyWriteToCache = true;
    size_t BUFFER_SIZE = pow(2, 28);
    size_t REPS = pow(2, 27);

    if (argc > 0)
    {
        for (int i = 1; i < argc; i++)
        {
            string option = argv[i];
            if (option == "-c")
            {
                useCriticalStep = true;
            }
            else if (option == "-r")
            {
                onlyWriteToCache = false;
            }
            else if (option[1] == 's')
            {
                string encodedSizeInMB = option.substr(2);
                size_t sizeInMB = atoi(encodedSizeInMB.c_str());
                BUFFER_SIZE = sizeInMB * pow(2, 20);
            }
            else if (option[1] == 'f')
            {
                string encodedNumOfReps = option.substr(2);
                size_t millionsOfReps = atoi(encodedNumOfReps.c_str());
                REPS = millionsOfReps * pow(10, 6);
            }
        }
    }

    //======================================================================
    // Machine parameters
    //======================================================================

    constexpr int CACHE_SIZE = pow(2, 15);
    constexpr int CACHE_LINE_SIZE = 64;
    constexpr int CACHE_LINES_PER_SET = 8;
    constexpr int SET_SIZE = CACHE_LINE_SIZE * CACHE_LINES_PER_SET;
    constexpr int NUM_OF_SETS = CACHE_SIZE / SET_SIZE;

    //======================================================================
    // Print out the machine parameters
    //======================================================================

    cout << "CACHE SIZE: " << CACHE_SIZE / 1024 << " KB" << endl;
    cout << "CACHE LINE SIZE: " << CACHE_LINE_SIZE << " bytes" << endl;
    cout << "CACHE LINES PER SET: " << CACHE_LINES_PER_SET << endl;
    cout << "SET SIZE: " << SET_SIZE << " bytes" << endl;
    cout << "NUMBER OF SETS: " << NUM_OF_SETS << endl;

    fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;

    //======================================================================
    // Test parameters
    //======================================================================

    const int STEP = NUM_OF_SETS * CACHE_LINE_SIZE + (useCriticalStep ? 0 : 1);

    //======================================================================
    // Print out the machine parameters
    //======================================================================

    cout << "BUFFER SIZE: " << BUFFER_SIZE / pow(2, 20) << " MB" << endl;
    cout << "STEP SIZE: " << STEP << " bytes" << endl;
    cout << "NUMBER OF REPS: " << REPS << endl;

    fill_n(ostream_iterator<char>(cout), 30, '='); cout << endl;

    //======================================================================
    // Start the test
    //======================================================================

    char* buffer = new char[BUFFER_SIZE];

    clock_t t1 = clock();

    int index = 0;
    for (size_t i = 0; i < REPS; i++)
    {
        index += STEP;
        if (index >= BUFFER_SIZE)
        {
            index = 0;
        }

        if (onlyWriteToCache)
        {
            buffer[index] = (char)(index % 255);
        }
        else
        {
            buffer[index] = (char)(buffer[index] % 255);
        }
    }

    clock_t t2 = clock();

    //======================================================================
    // Print the execution time (in clock ticks) and cleanup resources
    //======================================================================

    float executionTime = (float)(t2 - t1) / CLOCKS_PER_SEC;
    cout << "EXECUTION TIME: " << executionTime << "s" << endl;

    delete[] buffer;
}

如果你设法解读这个长长的问题, p>

Thank you in advance if you managed to read through this long question.

推荐答案

对于你的期望值3,你是对的。这是你可能期望的。有关详情,请查看每个程序员应该了解的记忆。这是一个很好的系列文章解释内存层次。

With regards to your expectation number 3, you are right. It is as you might expect. Please check "What every Programmer should know about memory" for more details. It's an excellent series of articles explaining the memory hierarchy.

那么为什么很难确认3号:有两个主要原因。一个是内存分配,另一个是虚拟 - 物理地址转换。

So why is it hard to confirm number 3: There are two main reasons. One is memory allocation and the other is virtual-physical address translation.

内存分配

没有严格保证实际物理地址分配的存储器区域。当你想测试CPU缓存时,我总是建议使用 posix_memalign 强制分配到一个特定的边界。否则您可能会看到一些奇怪的行为。

There is no strict guarantee what the actual physical address of an allocated memory region is. When you want to test CPU caches I always recommend using posix_memalign to force the allocation to a specific boundary. Otherwise you probably see some weird behavior.

地址翻译

地址翻译工作在我提到的文章中很好地解释。为了验证你的假设,你必须尝试找出预期的行为。最简单的方法如下:

The way how address translation works is nicely explained in the article I mentioned. And to verify your assumption you have to try to pinpoint the expected behaviour. The easiest way to do this is as follows:

实验

int 数组的形式中设置 k 大内存区域(类似512MB),并将它们全部对齐到页边界的4096b。现在迭代内存区域中的所有元素,并向您的实验增加 k 的更多区域。测量时间并按照读取的元素数进行标准化。

Allocate a set of k large memory regions (something like 512MB) in form of int arrays and align them all to the page boundary of 4096b. Now iterate over all elements in the memory region and incrementally add more regions of k to your experiment. Measure the time and normalize by the number of elements read.

代码可能如下所示:

#define N 10000000
for(size_t i=0; i < k; ++i) {

   size_t sum=0;
   clock_t t1= clock();
   for(size_t j=0; j < N; ++j) {
       for(size_t u=0; u<i; ++u) {
           sum += data[u][j];
       }
   }

   clock_t t2= clock();

}

所有大的存储器区域对齐到4k,并且基于先前的假设,相同行的所有元素将映射到相同的高速缓存集合中。当循环中的投影内存区域的数量大于高速缓存的关联性时,所有访问将导致高速缓存未命中,并且每个元素的平均处理时间将增加。

So what will happen. All large memory regions are aligned to 4k and based on the previous assumption all elements of same row will map into the same cache set. When the number of projected memory regions in the loop is larger than the associativity of the cache, all access will incur a cache miss and the average processing time per element will increase.

更新

如何处理写入取决于如何使用高速缓存行和CPU。现代CPU应用 MESI 协议来处理对缓存行的写入,以确保所有各方具有相同的查看内存(缓存一致性)。通常在可以写入高速缓存行之前,高速缓存行必须先读取,然后再写回。如果您认识到回写取决于您如何访问数据。如果您再次读取高速缓存行,您可能不会注意到差异。

How writes are handled depends on how the cache line is used and the CPU. Modern CPUs apply the MESI protocol for handling writes to cache lines to make sure that all parties have the same view on the memory (cache coherency). Typically before you can write to a cache line the cache line must be read and then written back. If you recognize the write-back or not depends on how you access the data. If you re-read the cache line again, you will probably not notice a difference.

但是,虽然程序员通常不会影响数据如何存储在CPU高速缓存,与写入有微小的差别。可以执行所谓的流写入,其不污染高速缓存,而是直接写入存储器。这些写入也称为非暂时的a>写入。

However, while the programmer has typically no influence on how the data is stored in the CPU caches, with writing there is a slight difference. It is possible to perform so called streaming writes that do not pollute the cache but are rather written directly to memory. These writes are also called non-temporal writes.

这篇关于基于访问类型的CPU高速缓存关键跨度测试给出意想不到的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆