当写入2个缓存行的一部分时,为什么在Skylake-Xeon上`_mm_stream_si128`要比`_mm_storeu_si128`慢得多?但是对Haswell的影响较小 [英] Why is `_mm_stream_si128` much slower than `_mm_storeu_si128` on Skylake-Xeon when writing parts of 2 cache lines? But less effect on Haswell

查看:111
本文介绍了当写入2个缓存行的一部分时,为什么在Skylake-Xeon上`_mm_stream_si128`要比`_mm_storeu_si128`慢得多?但是对Haswell的影响较小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似以下的代码(简单的加载,修改,存储)(我已对其进行了简化以使其更具可读性):

I have code that looks like this (simple load, modify, store) (I've simplified it to make it more readable):

__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
  __m128i in = _mm_loadu_si128(inptr);
  __m128i out = in; // real code does more than this, but I've simplified it
  _mm_stream_si12(outptr,out);
  inptr  += 12;
  outptr += 16;
}

与较新的Skylake机器相比,此代码在较旧的 Sandy Bridge Haswell硬件上的运行速度大约快5倍.例如,如果while循环运行大约16e9次迭代,则 Sandy Bridge Haswell花费14秒,而Skylake花费70秒.

This code runs about 5 times faster on our older Sandy Bridge Haswell hardware compared to our newer Skylake machines. For example, if the while loop runs about 16e9 iterations, it takes 14 seconds on Sandy Bridge Haswell and 70 seconds on Skylake.

我们在Skylake上升级了持久的微代码,并且卡在 vzeroupper 命令中,以避免出现任何AVX问题.两种修复均无效.

We upgraded to the lasted microcode on the Skylake, and also stuck in vzeroupper commands to avoid any AVX issues. Both fixes had no effect.

outptr 对齐为16个字节,因此 stream 命令应写入对齐的地址.(我检查过以核实这一说法). inptr 与设计不符.注释掉负载没有任何效果,限制命令就是存储. outptr inptr 指向不同的内存区域,没有重叠.

outptr is aligned to 16 bytes, so the stream command should be writing to aligned addresses. (I put in checks to verify this statement). inptr is not aligned by design. Commenting out the loads doesn't make any effect, the limiting commands are the stores. outptr and inptr are pointing to different memory regions, there is no overlap.

如果我将 _mm_stream_si128 替换为 _mm_storeu_si128 ,则代码在两台计算机上的运行速度都会更快,大约为2.9秒.

If I replace the _mm_stream_si128 with _mm_storeu_si128, the code runs way faster on both machines, about 2.9 seconds.

所以这两个问题是

1)为什么在使用 _mm_stream_si128 内在函数进行编写时, Sandy Bridge Haswell和Skylake之间有如此大的差异?

1) why is there such a big difference between Sandy Bridge Haswell and Skylake when writing using the _mm_stream_si128 intrinsic?

2)为什么 _mm_storeu_si128 的运行速度比流传输的同类产品快5倍?

2) why does the _mm_storeu_si128 run 5x faster than the streaming equivalent?

关于内在函数,我是一个新手.

I'm a newbie when it comes to intrinsics.

附录-测试案例

这是整个测试案例: https://godbolt.org/z/toM2lB

以下是我在两种不同的处理器(E5-2680 v3(哈斯韦尔)和8180(Skylake))上所进行的基准测试的总结.

Here is a summary of the benchmarks I took on two difference processors, E5-2680 v3 (Haswell) and 8180 (Skylake).

// icpc -std=c++14  -msse4.2 -O3 -DNDEBUG ../mre.cpp  -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
// The command line was
//    perf stat ./mre 100000
//
//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     1.65   7.29
//   _mm_storeu_si128     0.41   0.40

商店的比率分别为4倍或18倍.

The ratio of stream to store is 4x or 18x, respectively.

我依靠默认的 new 分配器将我的数据对齐为16个字节.我在这里很幸运,因为它是对齐的.我已经测试了这是真的,并且在生产应用程序中,我使用对齐的分配器来确保它是对的,并检查地址,但是我将其保留在示例中,因为我认为这并不重要

I'm relying on the default new allocator to align my data to 16 bytes. I'm getting luck here that it is aligned. I have tested that this is true, and in my production application, I use an aligned allocator to make absolutely sure it is, as well as checks on the address, but I left that off of the example because I don't think it matters.

第二次编辑-64B对齐的输出

@Mystical的评论使我检查了输出是否全部缓存对齐.对Tile结构的写操作以64-B块完成,但是Tiles本身不是64-B对齐的(仅16-B对齐的).

The comment from @Mystical made me check that the outputs were all cache aligned. The writes to the Tile structures are done in 64-B chunks, but the Tiles themselves were not 64-B aligned (only 16-B aligned).

因此更改了我的测试代码,如下所示:

So changed my test code like this:

#if 0
    std::vector<Tile> tiles(outputPixels/32);
#else
    std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif

现在数字完全不同:

//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     0.19   0.48
//   _mm_storeu_si128     0.25   0.52

所以一切都快得多.但是Skylake仍然比Haswell慢2倍.

So everything is much faster. But the Skylake is still slower than Haswell by a factor of 2.

第三编辑.故意未对准

我尝试了@HaidBrais建议的测试.我故意将向量类分配为64字节对齐,然后在分配器内部添加了16字节或32字节,以便分配是16字节或32字节对齐的,但不是64字节对齐的.我还将循环数增加到1,000,000,并进行了3次测试,并选择了最短的时间.

I tried the test suggested by @HaidBrais. I purposely allocated my vector class aligned to 64 bytes, then added 16 bytes or 32 bytes inside the allocator such that the allocation was either 16 Byte or 32 Byte aligned, but NOT 64 byte aligned. I also increased the number of loops to 1,000,000, and ran the test 3 times and picked the smallest time.

perf stat ./mre1  1000000

重申一下,对齐2 ^ N表示它未对齐2 ^(N + 1)或2 ^(N + 2).

To reiterate, an alignment of 2^N means it is NOT aligned to 2^(N+1) or 2^(N+2).

//   STORER               alignment time (seconds)
//                        byte  E5-2680   8180
// ---------------------------------------------------
//   _mm_storeu_si128     16       3.15   2.69
//   _mm_storeu_si128     32       3.16   2.60
//   _mm_storeu_si128     64       1.72   1.71
//   _mm_stream_si128     16      14.31  72.14 
//   _mm_stream_si128     32      14.44  72.09 
//   _mm_stream_si128     64       1.43   3.38

因此很明显,缓存对齐方式可以提供最佳结果,但是 _mm_stream_si128 仅在2680处理器上更好,并且在8180上遭受了我无法解释的惩罚.

So it is clear that cache alignment gives the best results, but _mm_stream_si128 is better only on the 2680 processor and suffers some sort of penalty on the 8180 that I can't explain.

为了将来使用,这是我使用的未对齐分配器(我没有对未对齐进行模板化,您必须编辑 32 并更改为 0 或<代码> 16 (根据需要):

For furture use, here is the misaligned allocator I used (I did not templatize the misalignment, you'll have to edit the 32 and change to 0 or 16 as needed):

template <class T >
struct Mallocator {
  typedef T value_type;
    Mallocator() = default;
      template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept 
{}
        T* allocate(std::size_t n) {
                if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
                    uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
                    if(! p1) throw std::bad_alloc();
                    p1 += 32; // misalign on purpose
                    return reinterpret_cast<T*>(p1);
                          }
          void deallocate(T* p, std::size_t) noexcept {
              uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
              p1 -= 32;
              std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }

...

std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

推荐答案

经过简化的代码并没有真正显示基准的实际结构.我认为简化的代码不会表现出您提到的缓慢性.

The simplified code doesn't really show the actual structure of your benchmark. I don't think the simplified code will exhibit the slowness you've mentioned.

您的Godbolt代码中的实际循环为:

The actual loop from your godbolt code is:

while (count > 0)
        {
            // std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
            __m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
            __m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
            __m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
            __m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));

            __m128i tileVal0 = value0;
            __m128i tileVal1 = value1;
            __m128i tileVal2 = value2;
            __m128i tileVal3 = value3;

            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);

            ptr    += diffBytes * 4;
            count  -= diffBytes * 4;
            tile   += diffPixels * 4;
            ipixel += diffPixels * 4;
            if (ipixel == 32)
            {
                // go to next tile
                ipixel = 0;
                tileIter++;
                tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
            }
        }

注意 if(ipixel == 32)部分.每当 ipixel 达到32时,它就会跳转到另一个图块.由于 diffPixels 为8,因此每次 迭代都会发生.因此,每个图块仅创建4个流存储(64字节).除非每个图块碰巧都是64字节对齐的(这不太可能偶然发生并且不能依赖),否则这意味着每次写入仅写入两个不同高速缓存行的一部分.这是流存储的一种已知反模式:为了有效使用流存储,您需要写出完整的行.

Note the if (ipixel == 32) part. This jumps to a different tile every time ipixel reaches 32. Since diffPixels is 8, this happens every iteration. Hence you are only making 4 streaming stores (64 bytes) per tile. Unless each tile happens to be 64-byte aligned, which is unlikely to happen by chance and cannot be relied on, this means that every write writes to only part of two different cache lines. That's a known anti-pattern for streaming stores: for effective use of streaming stores you need to write out the full line.

性能差异:流存储在不同硬件上的性能差异很大.这些存储区总是会占用行填充缓冲区一段时间,但会持续多长时间:在许多客户端芯片上,它似乎仅占用大约L3延迟的缓冲区.即,一旦流存储到达L3,就可以将其移交(L3将跟踪其余工作),并且可以在核心上释放LFB.服务器芯片通常具有更长的延迟.尤其是多路插座主机.

On to the performance differences: streaming stores have widely varying performance on different hardware. These stores always occupy a line fill buffer for some time, but how long varies: on lots of client chips it seems to only occupy a buffer for about the L3 latency. I.e., once the streaming store reaches the L3 it can be handed off (the L3 will track the rest of the work) and the LFB can be freed on the core. Server chips often have much longer latency. Especially multi-socket hosts.

很明显,NT存储在SKX盒上的性能较差,而对于部分行写入,很多则较差.总体上较差的性能可能与L3缓存的重新设计有关.

Evidently, the performance of NT stores is worse on the SKX box, and much worse for partial line writes. The overall worse performance is probably related to the redesign of the L3 cache.

这篇关于当写入2个缓存行的一部分时,为什么在Skylake-Xeon上`_mm_stream_si128`要比`_mm_storeu_si128`慢得多?但是对Haswell的影响较小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆