软件预取手动指令合理时的情况 [英] Scenarios when software prefetching manual instructions are reasonable

查看:130
本文介绍了软件预取手动指令合理时的情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



<$ p $

我已经阅读了关于x86和x86-64的介绍,英特尔 gcc 提供了特殊的预取指令: p> #include< xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void * p,enum _mm_hint h);

程序可以使用 _mm_prefetch
指针在程序中。与 _mm_prefetch
内部函数一起使用的不同提示是实现定义的。一般认为每个提示都有它自己的含义。

lockquote
_MM_HINT_T0
将数据提取到包含缓存的所有级别的缓存
和用于独占缓存的最低级缓存



_MM_HINT_T1 提示将数据拉入L2和
不在L1d中。如果有L3缓存, _MM_HINT_T2
提示可以为它做类似的操作



_MM_HINT_NTA ,允许告诉处理器专门处理预取的缓存行

所以有人可以在使用此指令时描述



以及如何正确选择提示?

解决方案

预取的想法是基于以下事实:


  • 第一次访问内存非常昂贵。

    第一次一个存储器地址 1 被访问时必须从存储器中取出,然后存储在缓存层次结构 2 中。
  • 访问存储器是固有的异步。

    CPU不需要任何来自内核的资源来执行加载/存储中最长的部分 3 ,因此它可以很容易地与其他任务 4



由于上述原因,在实际需要之前尝试加载是有意义的,这样当代码实际需要数据时,就不必等待。

这非常值得没有什么可以让CPU在寻找可以做的事情时遥遥领先,但不是任意深入;所以有时需要程序员的帮助才能达到最佳效果。


缓存层次本质上是微架构的一个方面,而不是架构(读取ISA)。英特尔或AMD无法对这些指令做什么提供有力保证。

此外,正确使用它们并不容易,因为程序员必须清楚地记住每条指令可以采用多少个周期。
最后,最新的CPU越来越擅长隐藏内存延迟并降低它。

因此,对于熟练的装配程序员来说,一般来说预取是一项工作。



也就是说,唯一可能的情况是在每次调用时一段代码的时间必须一致。

例如,如果您知道中断处理程序总是更新状态并且它必须尽可能快地执行,当设置使用这种中断的硬件时,值得预取状态变量。


关于预取的不同级别,我的理解是不同的级别(L1-L4)对应于不同的共享数量和如果执行该指令的线程/内核是,那么



例如 prefetch0 同样会读取变量。

但是,这将在所有缓存中占用一行,最终驱逐其他可能有用的行。
例如,当你知道你确实需要数据时,你可以使用它。


prefetch1 适合所有核心或核心群组快速使用数据(取决于共享L2的方式),而不会污染L1。

如果您知道您可能需要数据或您在完成其他任务后需要数据,则可以使用该数据(该数据优先于使用缓存)。
这并不像L1中的数据那样快,但比内存中的数据好得多。



prefetch2 可用于取出大部分内存访问延迟,因为它将数据移动到L3缓存中。

它不污染L1或L2,并且它在内核之间共享,所以它适用于罕见(但可能)代码路径使用的数据或为其他内核准备数据。


prefetchnta 是最容易理解的,它是一种非时间性的动作。它避免了在每个缓存行中为仅访问一次的数据创建条目。

prefetchw / prefetchwnt1 与其他版本相同,但会使行Exclusive,并使其他核心行无效。

基本上,它使写入速度更快,因为它处于MESI协议的最佳状态(用于缓存一致性)。



最后,预取可以逐步完成,首先进入L3,然后进入L1(仅用于需要它的线程)。 b
$ b

总之,每条指令都可以让您决定污染,共享和访问速度之间的妥协。

因为这些都需要跟踪缓存的使用仔细地(你需要知道,不值得在L1中创建和登录,但它在L2中),它的使用仅限于特定的环境。

在现代操作系统中,不可能跟踪的缓存中,您可以执行预取,只是为了找到您的量程已过期,并将您的程序替换为刚刚装载的线路的另一个程序。






至于一个具体的例子,我有点不合时宜。

过去,我必须尽可能一致地测量某个外部事件的时间。

我用中断来定期监视事件,在这种情况下,我预取了变量由中断处理程序所需,从而消除了第一次访问的延迟。 b
$ b另一个非正统的预取使用是将数据移入缓存。

这对于测试缓存系统非常有用或从内存中取消映射依赖于缓存的设备,以保持数据更长的时间。

在这种情况下移动到L3就足够了,并不是所有的CPU都有L3,所以我们可能需要移动到L2 。



我明白这些例子不是很好,但是。




实际上,粒度是缓存行而不是地址。

2 我假设您熟悉这个概念。简而言之:它现在从L1到L3 / L4。内核之间共享L3 / L4。 L1每个内核永远是私有的,并且由内核的线程共享,L2通常就像L1,但是某些型号可能会在内核对之间共享L2。

3 最长的部分是从RAM传输数据。计算地址并初始化事务会占用资源(例如存储缓冲区插槽和TLB条目)。

4 但是,用于访问内存的任何资源都可能成为一个关键问题,因为由@Leeor指出,由由Linux内核开发者证明


I have read about that on x86 and x86-64 Intel gcc provides special prefetching instructions:

#include <xmmintrin.h>
enum _mm_hint
{
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
};
void _mm_prefetch(void *p, enum _mm_hint h);

Programs can use the _mm_prefetch intrinsic on any pointer in the program. And The different hints to be used with the _mm_prefetch intrinsic are implementation defined. Generally said is that each of the hints have its own meaning.

_MM_HINT_T0 fetches data to all levels of the cache for inclusive caches and to the lowest level cache for exclusive caches

_MM_HINT_T1 hint pulls the data into L2 and not into L1d. If there is an L3 cache the _MM_HINT_T2 hints can do something similar for it

_MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially

So can someone describe examples when this instruction used?

And how to properly choose the hint?

解决方案

The idea of prefetching is based upon these facts:

  • Accessing memory is very expensive the first time.
    The first time a memory address1 is accessed is must be fetched from memory, it is then stored in the cache hierarchy2.
  • Accessing memory is inherently asynchronous.
    The CPU doesn't need any resource from the core to perform the lengthiest part of a load/store3 and thus it can be easily done in parallel with other tasks4.

Thanks to the above it makes sense to try a load before it is actually needed so that when the code will actually need the data, it won't have to wait.
It is very worth nothing that the CPU can go pretty far ahead when looking for something to do, but not arbitrarily deep; so sometimes it needs the help of the programmer to perform optimally.

The cache hierarchy is, by its very nature, an aspect of the micro-architecture not the architecture (read ISA). Intel or AMD cannot give strong guarantees on what these instructions do.
Furthermore using them correctly is not easy as the programmer must have clear in mind how many cycles each instruction can take. Finally, the latest CPU are getting more and more good at hiding memory latency and lowering it.
So in general prefetching is a job for the skilled assembly programmer.

That said the only possible scenario is where the timing of a piece of code must be consistent at every invocation.
For example, if you know that an interrupt handler always update a state and it must perform as fast as possible, it is worth, when setting the hardware that uses such interrupt, to prefetch the state variable.

Regarding the different level of prefetching, my understanding is that different levels (L1 - L4) correspond to different amounts of sharing and polluting.

For example prefetch0 is good if the thread/core that executes the instruction is the same that will read the variable.
However, this will take a line in all the caches, eventually evicting other, possibly useful, lines. You can use this for example when you know that you'll need the data surely in short.

prefetch1 is good to make the data quickly available for all core or core group (depending on how L2 is shared) without polluting L1.
You can use this if you know that you may need the data or that you'll need it after having done with another task (that takes priority in using the cache).
This is not as fast as having the data in L1 but much better than having it in memory.

prefetch2 can be used to take out most of the memory access latency since it moves the data in the L3 cache.
It doesn't pollute L1 or L2 and it is shared among cores, so it's good for data used by rare (but possible) code paths or for preparing data for other cores.

prefetchnta is the easiest to understand, it is a non-temporal move. It avoids creating an entry in every cache line for a data that is accessed only once.

prefetchw/prefetchwnt1 are like the others but makes the line Exclusive and invalidates other cores lines that alias this one.
Basically, it makes writing faster as it is in the optimal state of the MESI protocol (for cache coherence).

Finally, a prefetch can be done incrementally, first by moving into L3 and then by moving into L1 (just for the threads that need it).

In short, each instruction let you decide the compromise between pollution, sharing and speed of access.
Since these all require to keep track of the use of the cache very carefully (you need to know that it's not worth creating and entry in the L1 but it is in the L2) the use is limited to very specific environments.
In a modern OS, it's not possible to keep track of the cache, you can do a prefetch just to find your quantum expired and your program replaced by another one that evicts the just loaded line.


As for a concrete example I'm a bit out of ideas.
In the past, I had to measure the timing of some external event as consistently as possible.
I used and interrupt to periodically monitor the event, in such case I prefetched the variables needed by the interrupt handler, thereby eliminating the latency of the first access.

Another, unorthodox, use of the prefetching is to move the data into the cache.
This is useful if you want to test the cache system or unmap a device from memory relying on the cache to keep the data a bit longer.
In this case moving to L3 is enough, not all CPU has an L3, so we may need to move to L2 instead.

I understand these examples are not very good, though.


1 Actually the granularity is "cache lines" not "addresses".
2 Which I assume you are familiar with. Shortly put: It, as present, goes from L1 to L3/L4. L3/L4 is shared among cores. L1 is always private per core and shared by the core's threads, L2 usually is like L1 but some model may have L2 shared across pairs of cores.
3 The lengthiest part is the data transfer from the RAM. Computing the address and initializing the transaction takes up resources (store buffer slots and TLB entries for example).
4 However any resource used to access the memory can become a critical issue as pointed out by @Leeor and proved by the Linux kernel developer.

这篇关于软件预取手动指令合理时的情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆