为什么std :: fill(0)比std :: fill(1)慢? [英] Why is std::fill(0) slower than std::fill(1)?

查看:163
本文介绍了为什么std :: fill(0)比std :: fill(1)慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在系统上观察到,在大型 std :: vector< int< int> 上的 std :: fill 相对于恒定值 1 或动态值,将恒定值 0 设置为显着且始终较慢:

5.8 GiB / s和7.5 GiB / s



但是,结果不同对于较小的数据大小,其中 fill(0)更快:





具有多个线程,在4 GiB数据大小下, fill(1)显示斜率较高,但比 fill(0)的峰值要低得多(51 GiB / s与90 GiB / s):





这引起了第二个问题,为什么 fill(1)的峰值带宽是



为此的测试系统是双插槽Intel Xeon CPU E5-2680 v3,其设置为2.5 GHz(通过 / sys / cpufreq )和8x16 GiB DDR4-2133。我使用GCC 6.1.0( -O3 )和Intel编译器17.0.1( -fast )进行了测试,相同的结果。 GOMP_CPU_AFFINITY = 0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,设置为22,11,23 。 Strem / add / 24线程在系统上的速度为85 GiB / s。



我能够在其他Haswell双插槽服务器系统上重现此效果,但在其他任何服务器上均无法重现建筑。例如,在Sandy Bridge EP上,内存性能相同,而在高速缓存 fill(0)快得多。



此处是要复制的代码:

  #include< algorithm> 
#include< cstdlib>
#include< iostream>
#include< omp.h>
#include< vector>

使用值=整数;
使用vector = std :: vector< value> ;;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute __((noinline))fill0(vector& v){
std :: fill(v.begin(),v.end(),0);
}

void __attribute __((noinline))fill1(vector& v){
std :: fill(v.begin(),v.end(),1) ;
}

void bench(size_t data_size,int nthreads){
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size /( sizeof(value)* nthreads));
自动重复= write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for(自动r = 0; r<重复; r ++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for(自动r = 0; r<重复; r ++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std :: cout<< data_size<< ,<< nthreads<< ,<< write_size /(t1-t0)<< ,
<< write_size /(t2-t1)<< \n;
}
}

int main(int argc,const char * argv []){
std :: cout<< 大小,nthreads,fill0,fill1\n;
for(size_t字节= 1024;字节< = max_data_size;字节* = 2){
bench(bytes,1);
}
for(size_t字节= 1024;字节< = max_data_size;字节* = 2){
bench(bytes,omp_get_max_threads());
}
for(int nthreads = 1; nthreads< = omp_get_max_threads(); nthreads ++){
bench(max_data_size,nthreads);
}
}

使用 g ++编译的结果fillbench.cpp -O3 -o fillbench_gcc -fopenmp

解决方案

来自您的问题+编译器-从您的答案生成的asm:




  • fill(0) ERMSB rep stosb 将使用256b家商店在优化的微码循环中。 (如果缓冲区对齐,则效果最好,至少要对齐32B或64B)。

  • fill(1)是一个简单的128位 movaps 向量存储循环。每个内核时钟周期只能执行一个存储,而不论宽度如何(最大256b AVX)。因此128b个存储区只能填充Haswell L1D缓存写入带宽的一半。 这就是为什么 fill(0)的速度大约是32kB缓冲区的2倍。使用 -march = haswell -march = native 进行编译



    Haswell几乎无法跟上循环开销,但是即使它根本没有展开,它仍然可以每个时钟运行1个存储。但是,由于每个时钟4个融合域uops,很多填充器会占用乱序窗口的空间。某些展开操作可能会使TLB遗漏开始更远地解决存储发生的地方,因为存储地址oups的吞吐量要大于存储数据的吞吐量。对于适合L1D的缓冲区,展开可能有助于弥补ERMSB和此向量循环之间的其余差异。 (对该问题的评论说, -march = native 仅有助于 fill(1)的L1。)




请注意, rep movsd (可用于实现 fill(1)用于 int 元素)的性能可能与 rep stosb 在Haswell上。
尽管只有官方文档才能保证ERMSB快速提供 rep stosb (但不能保证 rep stosd ), 支持ERMSB的实际CPU在 rep stosd 。有关IvyBridge的问题,也许只有 b 是最快的。有关此更新,请参见@BeeOnRope出色的 ERMSB答案



gcc为字符串操作提供了一些x86调整选项( -mstringop-strategy = alg -mmemset-strategy = strategy ),但IDK(如果有的话)会以 fill(1)实际发出 rep movsd 。可能不是,因为我认为代码是作为循环而不是内存集开始的。







如果有多个线程,则在4 GiB数据大小下,fill(1)会显示较高的斜率,但会比fill(0)达到更低的峰值(51 GiB / s与90 GiB / s):


正常的偏移量存储到冷缓存行会触发读取所有权(RFO) movaps 写入前16个字节时,大量的实际DRAM带宽用于从内存中读取缓存行。 ERMSB存储对其存储使用no-RFO协议,因此内存控制器仅在写入。 (除了其他读操作外,例如页表,即使在L3缓存中也没有任何页面遍历丢失,或者在中断处理程序中或其他原因可能导致某些负载丢失)。



@BeeOnRope < a href = https://stackoverflow.com/questions/42558907/why-is-stdfill0-slower-than-stdfill1/45018779?noredirect=1#comment77014609_45018779>在注释中解释,即常规RFO之间的区别存储和ERMSB所使用的避免RFO协议在服务器CPU上某些缓冲区大小范围存在不利之处,其中非核心/ L3缓存中存在高延迟。 有关RFO与非RFO的更多信息,另请参见链接的ERMSB答案,以及多核Intel CPU中非核心(L3 /内存)的高延迟是单核带宽的问题。






movntps _mm_stream_ps ())存储,因此它们可以绕过高速缓存并一次直接进入整个高速缓存行,而无需将高速缓存行读入L1D。 movntps 避免了RFO,就像 rep stos 那样。 ( rep stos 商店可以相互重新排序,但不能在指令范围之外。)



您的 movntps 在更新后的答案中的结果令人惊讶。

对于具有大缓冲区的单个线程,您的结果为 movnt >>常规RFO> ERMSB 。因此,很奇怪的是,这两种非RFO方法位于普通旧商店的相对两侧,而ERMSB远非最佳。我目前没有任何解释。 (编辑会提供解释和充分的证据。)



正如我们所期望的, movnt 允许多个线程达到较高的目标聚合存储带宽,例如ERMSB。 movnt 总是直接进入行填充缓冲区,然后进入内存,因此对于适合高速缓存的缓冲区大小,它要慢得多。每个时钟一个128b的向量足以轻松地使DRAM的单核无RFO带宽饱和。存储CPU的结果时, vmovntps ymm (256b)仅比 vmovntps xmm (128b)可衡量的优势绑定的AVX 256b矢量化计算(即,只有在省去解压到128b的麻烦时)。



movnti 带宽之所以低,是因为每时钟1个存储uop中的4B块瓶颈中存储会向行填充缓冲区添加数据,而不是将这些行已满的缓冲区发送到DRAM(直到您有足够的线程来饱和内存带宽)。






@osgx发布评论中的一些有趣链接





另请参见标签Wiki。


I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:

5.8 GiB/s vs 7.5 GiB/s

However, the results are different for smaller data sizes, where fill(0) is faster:

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.

The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.

I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.

Here is the code to reproduce:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

解决方案

From your question + the compiler-generated asm from your answer:

  • fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
  • fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.

    Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell. Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the @BeeOnRope's excellent ERMSB answer for updates on this.

gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.


With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.


movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)

Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).


@osgx posted some interesting links in comments:

See also other stuff in the tag wiki.

这篇关于为什么std :: fill(0)比std :: fill(1)慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆