非暂时性指令如何工作? [英] How do non temporal instructions work?

查看:254
本文介绍了非暂时性指令如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Ulrich Drepper撰写的每个程序员应该了解的内容 pdf。在第6部分的开头有一个代码片段:

  #include< emmintrin.h> 
void setbytes(char * p,int c)
{
__m128i i = _mm_set_epi8(c,c,c,c,c,
c,c,c,c,
c,c,c,c,
c,c,c,c);
_mm_stream_si128((__ m128i *)& p [0],i);
_mm_stream_si128((__ m128i *)& p [16],i);
_mm_stream_si128((__ m128i *)& p [32],i);
_mm_stream_si128((__ m128i *)& p [48],i);
}

下面有这样的注释:


假设指针 p 正确对齐,则对该
函数的调用将设置全部已寻址的缓存行的字节数到 c
写合并逻辑将看到生成的四个movntdq指令
,并且仅在执行完最后一条
指令后才发出用于存储器的写命令。总而言之,该代码序列不仅
避免在写入之前读取缓存行,而且
避免使用可能很快不需要的数据污染缓存。


让我感到烦恼的是,在对该函数的注释中,它写道:会将已寻址的缓存行的所有字节都设置为c。但是据我对流本征的了解,它们绕过了缓存-既没有缓存读取也没有缓存写入。此代码将如何访问任何缓存行?第二个粗体的片段表示类似的写法,即该函数避免在写入之前读取缓存行。如上所述,我看不到缓存的写入方式和时间。此外,是否需要对缓存进行任何写操作?有人可以向我澄清这个问题吗?

解决方案

写入内存时,必须首先将写入的缓存行加载到内存中



当您写入内存时,存储在存储缓冲区中分组。通常,一旦缓冲区已满,它将被刷新到高速缓存/内存。请注意,存储缓冲区的数量通常很小(〜4)。连续写入地址将使用相同的存储缓冲区。



带有非时间提示的流式读/写通常用于减少缓存污染(通常使用WC内存)。想法是在CPU上保留一小组高速缓存行,以供这些指令使用。而不是将缓存行加载到主缓存中,而是将其加载到了这个较小的缓存中。



该注释假定具有以下行为(但我找不到硬件的任何引用)实际上,这样做需要测量或确定可靠的来源,并且可能因硬件而异):
-一旦CPU看到存储缓冲区已满并且与高速缓存行对齐,它将直接将其刷新到内存,因为非临时写操作会绕过主缓存。



唯一可行的方法是将存储缓冲区与实际缓存行合并写入一旦被刷新就会发生。这是一个合理的假设。 b

如果使用常规内存写而不是非临时写,则存储缓冲区刷新也将更新主缓存。这种情况也完全有可能避免读取内存中的原始高速缓存行。



如果部分高速缓存行是用非临时写入方式写的,则可能是高速缓存需要从主内存(或主高速缓存,如果存在)中获取该行,并且如果我们没有通过常规读取或非临时读取提前读取高速缓存行(可能会将其放入我们的单独内存中),可能会非常慢



通常,非时间缓存的大小约为4-8行。



总而言之,最后一条指令会启动写入操作,因为它也恰好会填满存储缓冲区。存储缓冲区刷新可以避免读取写入的高速缓存行,因为硬件知道存储缓冲区是连续的并且与高速缓存行对齐。非临时写提示仅用于避免使用我们的已写高速缓存行IF和仅在主高速缓存中尚未存在的情况下填充主高速缓存。


I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:

#include <emmintrin.h>
void setbytes(char *p, int c)
{
    __m128i i = _mm_set_epi8(c, c, c, c,
    c, c, c, c,
    c, c, c, c,
    c, c, c, c);
    _mm_stream_si128((__m128i *)&p[0], i);
    _mm_stream_si128((__m128i *)&p[16], i);
    _mm_stream_si128((__m128i *)&p[32], i);
    _mm_stream_si128((__m128i *)&p[48], i);
}

With such a comment right below it:

Assuming the pointer p is appropriately aligned, a call to this function will set all bytes of the addressed cache line to c. The write-combining logic will see the four generated movntdq instructions and only issue the write command for the memory once the last instruction has been executed. To summarize, this code sequence not only avoids reading the cache line before it is written, it also avoids polluting the cache with data which might not be needed soon.

What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?

解决方案

When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.

When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.

The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.

The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware): - Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.

The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.

Note that if the cache line written is already in the main caches, the above method will also update them.

If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.

If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).

Typically the non-temporal cache size is on the order of 4-8 cache lines.

To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.

这篇关于非暂时性指令如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆