SSE:未对齐的加载和存储跨越页边界 [英] SSE: unaligned load and store that crosses page boundary

查看:207
本文介绍了SSE:未对齐的加载和存储跨越页边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我读的地方,以页边界执行不对齐的负载或下一个存储(之前例如,使用 _mm_loadu_si128 / _mm_storeu_si128 内在),code应先检查是否整个载体(在此情况下16个字节)属于同一页,并且如果不切换到非矢量指令。我明白,如果下一页不属于这个过程是需要prevent转储。

I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to process.

但是,如果这两个网页属于处理(例如他们是一个缓冲区的一部分,我知道缓冲区的大小)是什么?我写了执行的跨越页边界未对齐的加载和存储的小测试程序,并没有崩溃。我一定要经常检查在这种情况下页边界,或者足以确保我不会溢出缓冲区?

But what if both pages belongs to process (e.g. they are part of one buffer, and I know size of that buffer)? I wrote small test program which performed unaligned load and store that crossed page boundary, and it did not crash. Do I have to always check for page boundary in such case, or it is enough to make sure I will not overflow the buffer?

ENV:Linux中,x86_64的,GCC

Env: Linux, x86_64, gcc

推荐答案

页行分割都是不好的表现,但不影响对齐访问的正确性。 这足以确保你不读了缓冲区的结尾,当你知道提前的时间长度。

Page-line splits are bad for performance, but don't affect correctness of unaligned accesses. It is enough to make sure you don't read past the end of the buffer, when you know the length ahead of time.

有关正确性,你经常需要担心它实施类似的strlen ,其中,当你发现一个标记值的循环停止的时候。这值可以是你的载体中的任何位置,所以只是做16B对齐负荷将读取过去的数组的末尾。如果终止 0 是在一个页面的最后一个字节,下一个页面无法读取,以及你目前的位置指针不对齐,这包括负荷 0 字节也将包括来自不可读页面字节,所以它会发生故障。

For correctness, you often need to worry about it when implementing something like strlen, where your loop stops when you find a sentinel value. That value could be at any position within your vector, so just doing 16B unaligned loads will read past the end of the array. If the terminating 0 is in the last byte of one page, and the next page is not readable, and your current-position pointer is unaligned, a load that includes the 0 byte will also include bytes from the unreadable page, so it will fault.

一个解决办法是做标,直到你的指针对齐,然后装入对准载体。对准载荷总是来自完全从一个网页,并且还从一个高速缓存行。因此,即使你会读一些字节过去的字符串的结尾,你保证不会发生故障。 Valgrind的可能会不高兴,虽然,但标准库的strlen 实现用这个。

One solution is to do scalar until your pointer is aligned, then load aligned vectors. An aligned load always comes entirely from one page, and also from one cache-line. So even though you will read some bytes past the end of the string, you are guaranteed not to fault. Valgrind might be unhappy about it, though, but standard library strlen implementations use this.

而不是标量,直到对齐的指针,你可以从字符串的开始做一个不对齐的载体(只要不会跨页行),然后做平衡负载。首先对准负荷将重叠第一未对齐的负载,但是这对于像strlen的功能,如果它看到相同的数据进行两次不关心完全罚款。

Instead of scalar until an aligned pointer, you could do an unaligned vector from the start of the string (as long as that won't cross a page-line), and then do aligned loads. The first aligned load will overlap the first unaligned load, but that's totally fine for a function like strlen that doesn't care if it sees the same data twice.

这可能是值得避免页分割线性能的原因。即使你知道你的src指针错位,它往往更快,让硬件处理高速缓存行拆分。但是SKYLAKE微架构之前,页面拆分有一个额外的〜100℃的等待时间。 (<一href=\"http://stackoverflow.com/questions/37361145/deoptimizing-a-program-for-the-pipeline-in-intel-sandybridge-family-cpus/37362225#37362225\">Down在SKYLAKE微架构 5C)。如果您有不同的可以相对彼此对齐多个指针,你不能总是只用一个序幕对准你的src。 (例如: C [I] = A [I] + B [I] C 对齐但 b 不是。)

It might be worth avoiding page-line splits for performance reasons. Even if you know your src pointer is misaligned, it's often faster to let the hardware handle cache-line splits. But before Skylake, page-splits have an extra ~100c latency. (Down to 5c in Skylake). If you have multiple pointers that can be aligned differently relative to each other, you can't always just use a prologue to align your src. (e.g. c[i] = a[i] + b[i], and c is aligned but b isn't.)

在这种情况下,它可能会使用一个分支做与以前和页面拆分后的平衡负载的价值,并与 palignr 将它们结合起来。

In that case, it might be worth using a branch to do aligned loads from before and after the page split, and combine them with palignr.

一个分支误predict(〜15℃)小于页面分割延迟便宜,但会延迟一切(而不仅仅是负载)。因此,它也可以的的是值得的,这取决于硬件和计算的比例内存访问。

A branch mispredict (~15c) is cheaper than the page-split latency, but delays everything (not just the load). So it might also not be worth it, depending on the hardware and ratio of computation to memory access.

如果你正在写一个函数,该函数的一般的调用对齐的指针,是有意义的只是使用未对齐加载/存储指令。任何序幕检测偏差是已经对齐的情况下,只需额外的开销,并在现代的硬件(Nehalem处理器和更新的版本),在地址变成在运行时保持一致对齐负载具有相同的性能对齐加载指令。 (但你需要对AVX对齐负荷折叠成其它指令存储器操作数。例如 vpxor XMM0,xmm1中,[RSI]

If you're writing a function that is usually called with aligned pointers, it makes sense to just use unaligned load/store instructions. Any prologue to detect misalignment is just extra overhead for the already-aligned case, and on modern hardware (Nehalem and newer), unaligned loads on address that turn out to be aligned at runtime have identical performance to aligned load instructions. (But you need AVX for unaligned loads to fold into other instructions as memory operands. e.g. vpxor xmm0, xmm1, [rsi])

通过添加code来处理未对齐的投入,你减慢共同一致的情况下,加快少见对齐情况。对于未对齐的加载/存储速度快的硬件支持让软件只留给硬件为少数情况下,它确实发生了。

By adding code to handle misaligned inputs, you're slowing down the common aligned case to speed up the uncommon misaligned case. Fast hardware support for unaligned loads/stores lets software just leave that to the hardware for the few cases where it does happen.

(如果错位投入是常见的,那么它的的值得用一个序幕调整您的输入指针,ESP。如果你使用AVX。连续32B AVX负荷将缓存线拆分所有其他负载。)

(If misaligned inputs are common, then it is worth it to use a prologue to align your input pointer, esp. if you're using AVX. Sequential 32B AVX loads will cache-line split every other load.)

详情参见瓦格纳雾的优化大会指南等环节中的 86 标记维基。

See Agner Fog's Optimizing Assembly guide for more info, and other links in the x86 tag wiki.

这篇关于SSE:未对齐的加载和存储跨越页边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆