当前的x86体系结构是否支持非临时性负载(来自“常规”内存)? [英] Do current x86 architectures support non-temporal loads (from "normal" memory)?

查看:76
本文介绍了当前的x86体系结构是否支持非临时性负载(来自“常规”内存)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道与此主题相关的多个问题,但是我没有看到明确的答案或基准测试。因此,我创建了一个简单的程序,该程序可以使用两个整数数组。第一个数组 a 非常大(64 MB),第二个数组 b 很小,无法放入L1缓存中。该程序在 a 上进行迭代,并以模块化的方式将其元素添加到 b 的相应元素中(当 b 到达,程序将从头开始重新启动。不同大小的 b 的L1高速缓存未命中次数如下:

I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache misses for different sizes of b is as follows:

这些测量是在具有32 kiB L1数据高速缓存的Xeon E5 2680v3 Haswell型CPU上进行的。因此,在所有情况下, b 都适合L1缓存。但是,未命中的数量大大增加了 b 的内存占用量约16 kiB。这可能是预料之中的,因为 a b 的加载都会导致<$ c开头的缓存行无效$ c> b 。

The measurements were made on a Xeon E5 2680v3 Haswell type CPU with 32 kiB L1 data cache. Therefore, in all the cases, b fitted into L1 cache. However, the number of misses grew considerably by around 16 kiB of b memory footprint. This might be expected since the loads of both a and b causes invalidation of cache lines from the beginning of b at this point.

绝对没有理由保留 a 在缓存中,它们仅使用一次。因此,我运行了一个带有 a 数据的非临时加载的程序变体,但是未命中的数量没有改变。我还运行了一个带有 a 数据的非临时预取的变体,但结果仍然非常相同。

There is absolutely no reason to keep elements of a in cache, they are used only once. I therefore run a program variant with non-temporal loads of a data, but the number of misses did not change. I also run a variant with non-temporal prefetching of a data, but still with the very same results.

我的基准代码如下(未显示非临时预取变量):

My benchmark code is as follows (variant w/o non-temporal prefetching shown):

int main(int argc, char* argv[])
{
   uint64_t* a;
   const uint64_t a_bytes = 64 * 1024 * 1024;
   const uint64_t a_count = a_bytes / sizeof(uint64_t);
   posix_memalign((void**)(&a), 64, a_bytes);

   uint64_t* b;
   const uint64_t b_bytes = atol(argv[1]) * 1024;
   const uint64_t b_count = b_bytes / sizeof(uint64_t);
   posix_memalign((void**)(&b), 64, b_bytes);

   __m256i ones = _mm256_set1_epi64x(1UL);
   for (long i = 0; i < a_count; i += 4)
       _mm256_stream_si256((__m256i*)(a + i), ones);

   // load b into L1 cache
   for (long i = 0; i < b_count; i++)
       b[i] = 0;

   int papi_events[1] = { PAPI_L1_DCM };
   long long papi_values[1];
   PAPI_start_counters(papi_events, 1);

   uint64_t* a_ptr = a;
   const uint64_t* a_ptr_end = a + a_count;
   uint64_t* b_ptr = b;
   const uint64_t* b_ptr_end = b + b_count;

   while (a_ptr < a_ptr_end) {
#ifndef NTLOAD
      __m256i aa = _mm256_load_si256((__m256i*)a_ptr);
#else
      __m256i aa = _mm256_stream_load_si256((__m256i*)a_ptr);
#endif
      __m256i bb = _mm256_load_si256((__m256i*)b_ptr);
      bb = _mm256_add_epi64(aa, bb);
      _mm256_store_si256((__m256i*)b_ptr, bb);

      a_ptr += 4;
      b_ptr += 4;
      if (b_ptr >= b_ptr_end)
         b_ptr = b;
   }

   PAPI_stop_counters(papi_values, 1);
   std::cout << "L1 cache misses: " << papi_values[0] << std::endl;

   free(a);
   free(b);
}

我想知道的是CPU供应商是否支持或将支持非临时性加载/预取或任何其他方式如何将某些数据标记为不在缓存中(例如,将其标记为LRU)。在某些情况下(例如,在HPC中),在实践中通常会遇到类似的情况。例如,在稀疏迭代线性求解器/本征求解器中,矩阵数据通常非常大(大于高速缓存容量),但矢量有时足够小以适合L3甚至L2高速缓存。然后,我们不惜一切代价将它们保留在那里。不幸的是,即使在每次求解器迭代中,矩阵数据的加载都可能导致特别是x向量高速缓存行无效,即使矩阵元素仅使用一次,也没有理由在处理它们之后将其保留在高速缓存中。

What I wonder is whether CPU vendors support or are going to support non-temporal loads / prefetching or any other way how to label some data as not-being-hold in cache (e.g., to tag them as LRU). There are situations, e.g., in HPC, where similar scenarios are common in practice. For example, in sparse iterative linear solvers / eigensolvers, matrix data are usually very large (larger than cache capacities), but vectors are sometimes small enough to fit into L3 or even L2 cache. Then, we would like to keep them there at all costs. Unfortunately, loading of matrix data can cause invalidation of especially x-vector cache lines, even though in each solver iteration, matrix elements are used only once and there is no reason to keep them in cache after they have been processed.

更新

我只是在测量英特尔®至强融核KNC的同时进行了类似的实验而不是L1错误(我还没有找到一种方法来可靠地测量它们; PAPI和VTune提供了怪异的指标。)结果在这里:

I just did a similar experiment on an Intel Xeon Phi KNC, while measuring runtime instead of L1 misses (I haven't find a way how to measure them reliably; PAPI and VTune gave weird metrics.) The results are here:

橙色曲线表示普通负载,具有预期的形状。蓝色曲线表示在指令前缀中设置了所谓的逐出提示(EH)的负载,灰色曲线表示手动逐出每个 a 高速缓存行的情况; KNC启用的这两种技巧显然都能奏效,我们希望在16 kiB内使用 b 。被测循环的代码如下:

The orange curve represents ordinary loads and it has the expected shape. The blue curve represents loads with so-call eviction hint (EH) set in the instruction prefix and the gray curve represents a case where each cache line of a was manually evicted; both these tricks enabled by KNC obviously worked as we wanted to for b over 16 kiB. The code of the measured loop is as follows:

while (a_ptr < a_ptr_end) {
#ifdef NTLOAD
   __m512i aa = _mm512_extload_epi64((__m512i*)a_ptr,
      _MM_UPCONV_EPI64_NONE, _MM_BROADCAST64_NONE, _MM_HINT_NT);
#else
   __m512i aa = _mm512_load_epi64((__m512i*)a_ptr);
#endif
   __m512i bb = _mm512_load_epi64((__m512i*)b_ptr);
   bb = _mm512_or_epi64(aa, bb);
   _mm512_store_epi64((__m512i*)b_ptr, bb);

#ifdef EVICT
   _mm_clevict(a_ptr, _MM_HINT_T0);
#endif

   a_ptr += 8;
   b_ptr += 8;
   if (b_ptr >= b_ptr_end)
       b_ptr = b;
}

更新2

在Xeon Phi上,为 a_ptr icpc c>:

On Xeon P icpc generated for normal-load variant (orange curve) prefetching for a_ptr:

400e93:       62 d1 78 08 18 4c 24    vprefetch0 [r12+0x80]

当我手动(通过对可执行文件进行十六进制编辑)将其修改为:

When I manually (by hex-editing the executable) modified this to:

400e93:       62 d1 78 08 18 44 24    vprefetchnta [r12+0x80]

我得到了所需的结果,甚至比蓝色/灰色曲线还好。但是,即使在循环之前使用 #pragma prefetch a_ptr:_MM_HINT_NTA ,我也无法强制编译器为我生成非时间性的prefetchnig。

I got the desired resutls, even better than the blue/gray curves. However, I was not able to force the compiler to generate non-temporal prefetchnig for me, even by using #pragma prefetch a_ptr:_MM_HINT_NTA before the loop :(

推荐答案

要专门回答标题问题:

>,最近的 1 主流Intel CPU支持 normal 2 内存上的非临时加载-但只能通过非临时预取指令间接进行,而不是直接使用诸如 movntdqa 之类的非临时性加载指令,这与非临时性存储相反,在非临时性存储中,您可以使用相应的非临时性存储指令 3 直接。

Yes, recent1 mainstream Intel CPUs support non-temporal loads on normal 2 memory - but only "indirectly" via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions3 directly.

基本思想是,在正常之前,您向缓存行发出 prefetchnta 加载,然后按正常方式发布加载。如果该行尚未在缓存中,则会以非临时方式加载。非临时方式的确切含义取决于弓形但是,通常的模式是将行装载到至少L1甚至更高的缓存级别中。实际上,要使预取具有任何用途,都需要使该行至少加载到 some 缓存级别中,以供以后的加载使用。还可以在缓存中对行进行特殊处理,例如,将其标记为驱逐的高优先级或限制其放置方式。

The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn't already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to loaded at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.

这就是说,虽然从某种意义上说,非临时性负载是受支持的,但它们实际上只是部分非临时性负载,与存储区不同,在存储区中,您在任何高速缓存级别中都不会留下任何痕迹。非临时负载将导致一些缓存污染,但通常少于常规负载。确切的细节是特定于体系结构的,我在下面为现代英特尔提供了一些细节(您可以在此答案中找到 / a>)。

The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I've included some details below for modern Intel (you can find a slightly longer writeup in this answer).

基于测试在此答案中看来, prefetchnta Skylake的行为通常是提取到L1缓存中,以跳过完全是L2,并以有限的方式提取到L3缓存中(可能仅以1或2种方式提取,因此 nta 可用的L3总量有限)。

Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).

已在 Skylake客户端,但我认为这种基本行为可能会向后扩展到Sandy Bridge和更早的版本(基于英特尔优化指南中的措辞),并且还会转发到Kaby Lake和后来的arc基于Skylake客户端的基础设施。因此,除非您使用的是Skylake-SP或Skylake-X部件,或者使用的是非常老的CPU,否则这可能是您可以从 prefetchnta 期望的行为。

This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.

最近唯一已知具有不同行为的英特尔芯片是 Skylake服务器(用于Skylake-X,Skylake-SP和其他几行)。这大大改变了L2和L3体系结构,并且L3不再包含更大的L2。对于此芯片,似乎 prefetchnta 会同时跳过L2和L3缓存,因此在此架构上,缓存污染仅限于L1。

The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.

此行为是用户Mysticial在评论中报告了。那些评论指出的缺点是,这会使 prefetchnta 变得更加脆弱:如果您获得了预取距离或计时错误(尤其是在涉及到超线程和同级时,这特别容易核心是活动的),并且在使用前从L1清除数据,您将一直返回主内存,而不是早期架构上的L3。

This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.

1 Recent 在这里可能意味着最近十年左右的一切,但我的意思并不是要暗示较早的硬件没有t支持非临时预取:支持可能追溯到 prefetchnta 的引入,但是我没有硬件来检查它并且找不到现有的可靠的信息来源。

1 Recent here probably means anything in the last decade or so, but I don't mean to imply that earlier hardware didn't support non-temporal prefetch: it's possible that support goes right back to the introduction of prefetchnta but I don't have the hardware to check that and can't find an existing reliable source of information on it.

2 Normal 在这里仅表示WB(写回)内存,即内存在应用程序级别中占绝大多数。

2 Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.

3 具体来说,NT存储指令是 movnti 用于通用寄存器,以及 movntd * movntp * 个SIMD寄存器族。

3 Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.

这篇关于当前的x86体系结构是否支持非临时性负载(来自“常规”内存)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆