SSE向量化与内存对齐的关系 [英] Relationship between SSE vectorization and Memory alignment

查看:41
本文介绍了SSE向量化与内存对齐的关系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么我们需要 SSE/AVX 的对齐内存?

Why do we need aligned memory for SSE/AVX?

我经常得到的答案之一是对齐的内存加载比未对齐的内存加载快得多.那么,为什么这种对齐的内存加载比未对齐的内存加载快得多?

One of the answer I often get is aligned memory load is much faster than unaligned memory load. Then, why is this aligned memory load is much faster than unaligned memory load?

推荐答案

这不仅特定于 SSE(甚至 x86).在大多数架构上,加载和存储需要自然对齐,否则它们要么 (a) 生成异常,要么 (b) 需要两个或更多周期加上一些修复,以便透明地处理未对齐的加载/存储.在 x86 (b) 上,对于数据类型 <16 个字节,但 (a) 对于 SSE 数据类型是正确的,除非您明确使用可以处理未对齐数据的加载/存储指令的未对齐版本.

This is not just specific to SSE (or even x86). On most architectures loads and stores need to be naturally aligned otherwise they either (a) generate an exception or (b) need two or more cycles plus some fix up in order to handle the misaligned load/store transparently. On x86 (b) is true for data types < 16 bytes but (a) is true for SSE data types unless you explicitly use misaligned versions of the load/store instructions which can handle misaligned data.

您可能想知道:为什么不使用这些 SSE 加载/存储指令的未对齐版本而不管对齐如何?答案是这些指令通常比它们对齐的对应指令慢得多,因为它们通常按照上述 (b) 的方式运行,这使得它们通常慢 2 倍或更多,除了最近的英特尔 CPU(例如 Core i7),惩罚要小得多,但并非无足轻重.

You might wonder: why not just use the misaligned versions of these SSE load/store instructions regardless of alignment? The answer is that these instructions are typically much slower than their aligned counterparts as they generally behave as per (b) above, which makes them typically 2x or more slower, apart from recent Intel CPUs such as Core i7, where the penalty is much smaller, but not insignificant.

这篇关于SSE向量化与内存对齐的关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆