在 x64_64 Intel CPU 上,SSE 未对齐的内在负载是否比内在对齐的负载慢? [英] Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

查看:63
本文介绍了在 x64_64 Intel CPU 上,SSE 未对齐的内在负载是否比内在对齐的负载慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑更改一些当前需要 16 字节对齐数组并使用 _mm_load_ps 放宽对齐约束并使用 _mm_loadu_ps 的代码高性能代码.关于内存对齐对 SS​​E 指令的性能影响有很多神话,所以我做了一个小测试用例,什么应该是内存带宽绑定循环.使用对齐或未对齐的负载内在函数,它通过大型数组运行 100 次迭代,将元素与 SSE 内在函数相加.源代码在这儿.https://gist.github.com/rmcgibbo/7689820

I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps. There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code is here. https://gist.github.com/rmcgibbo/7689820

在配备 Sandy Bridge Core i5 的 64 位 Macbook Pro 上的结果如下.较低的数字表示更快的性能.当我阅读结果时,我发现在未对齐的内存上使用 _mm_loadu_ps 基本上没有性能损失.

The results on a 64 bit Macbook Pro with a Sandy Bridge Core i5 are below. Lower numbers indicate faster performance. As I read the results, I see basically no performance penalty from using _mm_loadu_ps on unaligned memory.

我觉得这很令人惊讶.这是一个公平的测试/合理的结论吗?在哪些硬件平台上存在差异?

I find this surprising. Is this a fair test / justified conclusion? On what hardware platforms is there a difference?

$ gcc -O3 -msse aligned_vs_unaligned_load.c  && ./a.out  200000000
Array Size: 762.939 MB
Trial 1
_mm_load_ps with aligned memory:    0.175311
_mm_loadu_ps with aligned memory:   0.169709
_mm_loadu_ps with unaligned memory: 0.169904
Trial 2
_mm_load_ps with aligned memory:    0.169025
_mm_loadu_ps with aligned memory:   0.191656
_mm_loadu_ps with unaligned memory: 0.177688
Trial 3
_mm_load_ps with aligned memory:    0.182507
_mm_loadu_ps with aligned memory:   0.175914
_mm_loadu_ps with unaligned memory: 0.173419
Trial 4
_mm_load_ps with aligned memory:    0.181997
_mm_loadu_ps with aligned memory:   0.172688
_mm_loadu_ps with unaligned memory: 0.179133
Trial 5
_mm_load_ps with aligned memory:    0.180817
_mm_loadu_ps with aligned memory:   0.172168
_mm_loadu_ps with unaligned memory: 0.181852

推荐答案

您的结果中有很多干扰.我在运行 Debian 7 的 Xeon E3-1230 V2 @ 3.30GHz 上重新运行它,在 200000000 阵列上运行 12 次(丢弃第一个考虑虚拟内存噪声),对 i 在基准函数中,为您提供的函数显式 noinline,并且三个基准中的每一个都独立运行:https://gist.github.com/creichen/7690369

You have a lot of noise in your results. I re-ran this on a Xeon E3-1230 V2 @ 3.30GHz running Debian 7, doing 12 runs (discarding the first to account for virtual memory noise) over a 200000000 array, with 10 iterations for the i within the benchmark functions, explicit noinline for the functions you provided, and each of your three benchmarks running in isolation: https://gist.github.com/creichen/7690369

这是使用 gcc 4.7.2.

This was with gcc 4.7.2.

noinline 确保第一个基准测试没有被优化.

The noinline ensured that the first benchmark wasn't optimised out.

确切的调用是

./a.out 200000000 10 12 $n

对于 $n02.

结果如下:

load_ps 对齐

min:    0.040655
median: 0.040656
max:    0.040658

loadu_ps 对齐

loadu_ps aligned

min:    0.040653
median: 0.040655
max:    0.040657

loadu_ps 未对齐

loadu_ps unaligned

min:    0.042349
median: 0.042351
max:    0.042352

如您所见,这些是一些非常严格的界限,表明 loadu_ps 在未对齐访问时速度较慢(约 5% 的减速),但在对齐访问时则不然.显然,在该特定机器上,loadu_ps 不会对对齐的内存访问造成任何损失.

As you can see, these are some very tight bounds that show that loadu_ps is slower on unaligned access (slowdown of about 5%) but not on aligned access. Clearly on that particular machine loadu_ps pays no penalty on aligned memory access.

看汇编,load_psloadu_ps 版本的唯一区别是后者包含了一个 movups 指令,重新排序一些其他指令来补偿,并使用略有不同的寄存器名称.后者可能完全无关,前者可以在微码翻译过程中得到优化.

Looking at the assembly, the only difference between the load_ps and loadu_ps versions is that the latter includes a movups instruction, re-orders some other instructions to compensate, and uses slightly different register names. The latter is probably completely irrelevant and the former can get optimised out during microcode translation.

现在,很难说(如果不是可以访问更详细信息的英特尔工程师)movups 指令是否/如何被优化,但考虑到 CPU 芯片不会为如果加载地址中的低位为零,则简单地使用对齐的数据路径,否则使用未对齐的数据路径,这对我来说似乎是合理的.

Now, it's hard to tell (without being an Intel engineer with access to more detailed information) whether/how the movups instruction gets optimised out, but considering that the CPU silicon would pay little penalty for simply using the aligned data path if the lower bits in the load address are zero and the unaligned data path otherwise, that seems plausible to me.

我在 Core i7 笔记本电脑上尝试了相同的方法,结果非常相似.

I tried the same on my Core i7 laptop and got very similar results.

总而言之,我会说是的,您确实会为未对齐的内存访问付出代价,但它足够小,可以被其他影响淹没.在您报告的运行中,似乎有足够的噪音来假设它对您来说也较慢(请注意,您应该忽略第一次运行,因为您的第一次试验将付出代价来预热页表和缓存.)

In conclusion, I would say that yes, you do pay a penalty for unaligned memory access, but it is small enough that it can get swamped by other effects. In the runs you reported there seems to be enough noise to allow for the hypothesis that it is slower for you too (note that you should ignore the first run, since your very first trial will pay a price for warming up the page table and caches.)

这篇关于在 x64_64 Intel CPU 上,SSE 未对齐的内在负载是否比内在对齐的负载慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆