如何正确使用预取指令? [英] How to properly use prefetch instructions?

查看:321
本文介绍了如何正确使用预取指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图对循环进行矢量化处理,计算出大浮点矢量的点积.我正在利用CPU具有大量XMM寄存器的事实来并行计算它,如下所示:

I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this:

__m128* A, B;
__m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0);
for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
}
... // add dots, then shuffle/hadd result.

我听说使用预取指令可以帮助加快处理速度,因为它可以在多"处理并添加缓存中的数据的同时在后台"获取更多数据.但是我没有找到关于如何使用_mm_prefetch()的示例和说明,何时,使用什么地址和点击.您能帮上忙吗?

I heard that using prefetch instructions could help speedup things, as it could fetch further data "in background", while doing muls and adds on a data that is in cache. However i failed to find examples and explanations on how to use _mm_prefetch(), when, with what addresses, and what hits. Could you assist on this?

推荐答案

最适合像您这样的完美线性流循环的简短答案可能是:完全不要使用它们,让硬件预取器执行工作.

The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.

不过,可能可以通过软件预取来加快处理速度,这是理论和一些详细信息,如果您想尝试...

Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...

基本上,您会在将来某个时候需要的地址上呼叫_mm_prefetch().在某些方面,这类似于从内存加载值并且不执行任何操作:都将行带入L1缓存 2 ,但是预取内在函数(在幕后发出特定的预取指令,具有一些使其适合于预取的优点.

Basically you call _mm_prefetch() on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache2, but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.

它以高速缓存行粒度 1 起作用:您只需要为每个高速缓存行发出一个预取:更多只是浪费.这意味着通常,您应该尝试充分展开循环,以使每个高速缓存行只能发出一个预取.对于16字节的__m128值,这意味着至少展开4次(您已经完成了,所以在这里不错).

It works at cache-line granularity1: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128 values, that means unroll at least by 4 (which you've done, so you are good there).

然后简单地在当前计算之前预先以每个PF_DIST距离预取您的每个访问流,例如:

Then simple prefetch each of your access streams by some PF_DIST distance ahead of the current calculation, something like:

for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
    _mm_prefetch(A + i + PF_A_DIST, HINT_A);
    _mm_prefetch(B + i + PF_B_DIST, HINT_B);
}

此处PF_[A|B]_DIST是当前迭代之前要预取的距离,而HINT_是要使用的时间提示.与其尝试从第一原理中计算出正确的距离值,不如说是通过实验 4 确定PF_[A|B]_DIST的合适值.为了减少搜索空间,您可以先将它们设置为相等,因为从逻辑上讲,相似的距离可能是理想的选择.您可能会发现只预取两个流之一是理想的.

Here PF_[A|B]_DIST is the distance to prefetch ahead of the current iteration and HINT_ is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST experimentally4. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.

理想的PF_DIST 取决于硬件配置非常重要.不仅在CPU型号上,而且在内存配置上,包括详细信息,例如多插槽系统的侦听模式.例如,在同一CPU系列的客户端和服务器芯片上,最佳价值可能会大相径庭.因此,您应该尽可能在目标硬件上运行调整实验.如果您针对各种硬件,则可以在所有硬件上进行测试,并希望找到一个对所有硬件都适用的值,甚至可以根据CPU类型(如上所述并非总是足够)或基于在运行时测试中.现在,仅依靠硬件预取就听起来好多了,不是吗?

It is very important that the ideal PF_DIST depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?

由于搜索空间很小(只能尝试4个值),因此可以使用相同的方法找到最佳的HINT-但在这里您应该意识到不同提示(尤其是_MM_HINT_NTA)之间的差异可能会有所不同仅在此循环之后运行 的代码中显示为性能差异,因为它们会影响与该内核无关的数据保留在缓存中.

You can use the same approach to find the best HINT since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.

您可能还会发现这种预取根本没有帮助,因为您的访问模式是完全线性的,并且很可能由L2流预取器很好地处理.仍然可以尝试或考虑一些其他的,更多的硬编码内容:

You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:

  • 您可能会调查仅在4K页面边界的开头进行预取是否对 3 有帮助.这将使循环结构变得复杂:您可能需要一个嵌套循环来分隔页面附近"和页面内部"情况,以便仅在页面边界附近发出预取.您还需要将输入数组也进行页面对齐,否则它将变得更加复杂.
  • 您可以尝试禁用部分/全部硬件预取器.这通常对于整体性能来说很糟糕,但是在软件预取的高度调整的负载下,通过消除来自硬件预取的干扰,您可能会看到更好的性能.选择禁用预取功能还可以为您提供一个重要的关键工具,即使您最终使所有预取功能都处于启用状态,也可以帮助您了解正在发生的事情.
  • 请确保您使用的是大页面,因为对于像这样的大型连续块来说,这是个主意.
  • 在主计算循环的开始和结束时存在预取问题:在开始时,您会错过在每个数组的开始(在初始PF_DIST窗口内)和末尾预取所有数据的机会.在循环中,您将预取其他内容,并在数组末尾预取PF_DIST .这些浪费充其量和指令带宽充其量是最好的,但它们也可能导致(最终被丢弃)页面错误,这可能会影响性能.您可以通过特殊的intro和outro循环进行修复,以处理这些情况.
  • You might investigate whether prefetching only at the start of 4K page boundaries helps3. This will complicate your loop structure: you'll probably need a nested loop to separate the "near edge of page" and "deep inside the page" cases in order to only issue the prefetches near page boundaries. You'll also want to make your input arrays page-aligned too, or else it gets even more complicated.
  • You can try disabling some/all of the hardware prefetchers. This is usually terrible for overall performance, but on a highly tuned load with software prefetching, you might see better performance by eliminating interference from hardware prefetching. Selecting disabling prefetching also gives you an important a key tool to help understand what's going on, even if you ultimately leave all the prefetchers enabled.
  • Make sure you are using huge pages, since for large contiguous blocks like this they are idea.
  • There are problems with prefetching at the beginning and end of your main calculation loop: at the start, you'll miss prefetching all data at the start of each array (within the initial PF_DIST window), and at the end of the loop you'll prefetch additional and PF_DIST beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.

我也强烈推荐5部分博客文章优化AMD Opteron内存带宽 ,它描述了与您的问题非常相似的优化问题,并涵盖了预取的详细信息(极大地促进了预取).现在,这是完全不同的硬件(AMD Opteron),其行为可能与较新的硬件(特别是如果您使用的是Intel硬件)有所不同-但是改进过程很关键,作者是该领域的专家.

I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.

1 实际上,它可以像2-cache-line粒度那样工作,具体取决于它与相邻缓存行预取器的交互方式.在这种情况下,您可以摆脱发出预取次数一半的麻烦:每128字节发送一次.

1 It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.

2 对于软件预取,您还可以使用时间提示来选择其他级别的缓存.

2 In the case of software prefetch, you can also select some other level of cache, using the temporal hint.

3 有迹象表明,即使具有完美的流负载,并且尽管现代Intel硬件中存在下一页预取器",但页面边界仍然是硬件预取的障碍,这可能是部分原因通过软件预取得以缓解.也许是因为软件预取可以更好地暗示是的,我要阅读此页面",或者是因为软件预取可以在虚拟地址级别上工作,并且必然涉及转换机制,而L2预取则可以在物理级别上工作.

3 There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.

4 请注意,由于我计算地址的方式,PF_DIST值的单位"为sizeof(__mm128),即16个字节.

4 Note that the "units" of the PF_DIST value is sizeof(__mm128), i.e., 16 bytes due to the way I calculated the address.

这篇关于如何正确使用预取指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆