何时程序会受益于预取和非临时加载/存储? [英] When program will benefit from prefetch & non-temporal load/store?

查看:144
本文介绍了何时程序会受益于预取和非临时加载/存储?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对此做了测试

    for (i32 i = 0; i < 0x800000; ++i)
    {
        // Hopefully this can disable hardware prefetch
        i32 k = (i * 997 & 0x7FFFFF) * 0x40;

        _mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);

        for (i32 j = 0; j < 0x40; j += 0x10)
        {
            //__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
            __m128 v = _mm_load_ps((float *)(data + k + j));

            a_single_chain_computation

            //_mm_stream_ps((float *)(data2 + k + j), v);
            _mm_store_ps((float *)(data2 + k + j), v);
        }
    }

结果很奇怪.

  1. 无论a_single_chain_computation需要花费多少时间,加载延迟都不会被隐藏.
  2. 此外,随着我​​添加更多计算,所花费的额外总时间也会增加. (对于单个v = _mm_mul_ps(v, v),预取可节省约0.60-0.57 = 0.03s.对于16 v = _mm_mul_ps(v, v),预取可节省约1.1-0.75 = 0.35s.为什么?)
  3. 无论是否进行预取,非临时性加载/存储都会降低性能. (我能理解负载部分,但为什么也要存储?)
  1. No matter how much time the a_single_chain_computation takes, the load latency is not hidden.
  2. And what's more, the additional total time taken grows as I add more computation. (With a single v = _mm_mul_ps(v, v), prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 v = _mm_mul_ps(v, v), it saves about 1.1 - 0.75 = 0.35s. WHY?)
  3. non-temporal load/stores degrades performance with or without prefetching. (I can understand the load part, but why stores, too?)

推荐答案

您需要在此处将两个不同的内容分开(不幸的是,它们具有相似的名称):

You need to separate two different things here (which unfortunately have a similar name) :

  • 非临时预取-这将预取该行,但是在填充高速缓存时将其写为最近最少使用的那一行,因此当您下次使用同一集合时,它将是第一个逐出的行.这就给您留出了足够的时间来实际使用它(除非您很不幸),但是不会浪费太多的时间,因为下次进行的预取只会替换它.顺便说一句,关于您上面的评论-每个预取都会污染L3缓存,因为它具有包容性,所以如果没有它,您将无法摆脱.

  • Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.

非临时(流式)加载/存储-这也不会污染缓存,但是使用完全不同的机制使它们不可缓存(以及写合并).即使您确实真的不再需要这些行,这的确会对性能造成损害,因为可缓存的写操作具有将缓存一直保留在缓存中直到被逐出的奢侈性,因此您不必立即写出来.使用不可缓存,您会这样做,并且在某些情况下,它可能会干扰您的内存带宽.另一方面,在几种情况下,您可以获得写合并和弱排序的好处,这可能会给您带来一定的优势.最重要的是,仅在有帮助时才应使用它,不要以为它可以神奇地提高性能(如今什么也没有.)

Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)

关于您的问题-

  1. 您的预取应该可以,但是还不足以产生影响.尝试用更大的数字替换i+1.实际上,也许甚至扫一扫,都会很有趣,看看您应该预先查看多少个元素.

  1. your prefetching should work, but it's not early enough to make an impact. try replacing i+1 with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.

我猜这与1相同-16 muls,您的迭代就足够进行预取了

i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work

正如我所说-您的存储区将不会受益于在较低级别的缓存中进行缓冲,因此必须刷新到内存中.这就是流媒体商店的弊端.当然,它是特定于实现的,因此它可能会有所改进,但目前并不总是有效的.

As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.

这篇关于何时程序会受益于预取和非临时加载/存储?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆