了解`_mm_prefetch` [英] Understanding `_mm_prefetch`

查看:76
本文介绍了了解`_mm_prefetch`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

答案

最后,我认为指南是另一个答案所说的:集思广益,实施,测试和衡量.您现在处在性能的前沿,不会有一个适合所有答案的尺寸.

另一个可能帮助您的资源是 Agner Fog的优化手册,它将为您提供帮助针对您的特定CPU进行优化.

The answer What are _mm_prefetch() locality hints? goes into details on what the hint means.

My question is: which one do I WANT?

I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to the next, so I am doing:

int foo(int key) {
  uint8_t value = cache[key];
  _mm_prefetch((const char *)&cache[key * 2], _MM_HINT_T2);
  // ...

The goal is to have this value in a processor cache by the next call to this function.

I am looking for confirmation on my understanding of two points:

  1. The call to _mm_prefetch is not going to delay the processing of the instructions immediately following it.
  2. There is no penalty for pre-fetching wrong location, just a lost benefit from guessing it right.

That function is using a lookup table of 128 128-bit values (2 KB total). Is there a way to "force" it to be cached? The index into that lookup table is incremented sequentially; should I pre-fetch them too? I should probably use another hint, to point to another level of cache? What is the best strategy here?

解决方案

As I noted in the comments, there's some risk to prefetching the wrong address - a useful address will be evicted from the cache, potentially causing a cache miss.

That said:

_mm_prefetch compiles into the PREFETCHn instruction. I looked up the instruction in the AMD64 Architecture Programmer's Manual published by AMD. (Note that all of this information is necessarily chipset specific; you may need to find your CPU's docs).

AMD says (my emphasis):

The operation of this instruction is implementation-dependent. The processor implementation can ignore or change this instruction. The size of the cache line also depends on the implementation, with a minimum size of 32 bytes. AMD processors alias PREFETCH1 and PREFETCH2 to PREFETCH0

What that appears to mean is that if you're running on an AMD, then the hint is ignored, and the memory is loaded into the all levels of the cache -- unless it's a hint that it's a NTA (Non-Temporal-Access, attempts to load memory with minimal cache pollution).

Here's the full page for the instruction

I think in the end, the guidance is what the other answer says: brainstorm, implement, test, and measure. You're on the bleeding edge of perf here, and there's not going to be a one size fits all answer.

Another resource that may help you is Agner Fog's Optimization manuals, which will help you optimize for your specific CPU.

这篇关于了解`_mm_prefetch`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆