在 x86 上成功的未对齐访问的实际影响是什么? [英] What's the actual effect of successful unaligned accesses on x86?

查看:24
本文介绍了在 x86 上成功的未对齐访问的实际影响是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我总是听说未对齐的访问是不好的,因为它们会导致运行时错误并使程序崩溃或减慢内存访问速度.但是,我找不到任何关于它们会减慢速度的实际数据.

假设我在 x86 上有一些(但未知的)未对齐访问份额 - 实际可能出现的最严重的减速是什么,我如何在不消除所有未对齐访问和比较两个版本代码的运行时间的情况下估计它?

解决方案

这取决于指令,对于大多数 x86 SSE 加载/存储指令(不包括未对齐的变体),它会导致错误,这意味着它'可能会导致程序崩溃或导致对异常处理程序的大量往返(这意味着几乎或所有性能都会丢失).未对齐的加载/存储变体的运行周期是 IIRC 的两倍,因为它们执行部分读/写,因此需要 2 个来执行操作(除非你很幸运并且它在缓存中,这大大减少了损失).

对于一般的 x86 加载/存储指令,代价是速度,因为执行读取或写入需要更多周期.未对齐也可能影响缓存,导致缓存线分裂和缓存边界跨越.它还防止读取和写入的原子性(保证 x86 的所有对齐读/写,屏障和传播是另一回事,但是在未对齐的数据上使用 LOCK 指令可能会导致异常或大大增加 bu锁定),这是并发编程的禁忌.

英特尔 x86 &x64 优化手册详细介绍了上述每个问题、它们的副作用以及如何补救.

Agner Fog 的优化手册应该包含您正在寻找的原始周期的确切数字吞吐量.

I always hear that unaligned accesses are bad because they will either cause runtime errors and crash the program or slow memory accesses down. However I can't find any actual data on how much they will slow things down.

Suppose I'm on x86 and have some (yet unknown) share of unaligned accesses - what's the worst slowdown actually possible and how do I estimate it without eliminating all unaligned accesses and comparing run time of two versions of code?

解决方案

It depends on the instruction(s), for most x86 SSE load/store instructions (excluding unaligned variants), it will cause a fault, which means it'll probably crash your program or lead to lots of round trips to your exception handler (which means almost or all performance is lost). The unaligned load/store variants run at double the amount of cycles IIRC, as they perform partial read/writes, so 2 are required to perform the operation (unless you are lucky and its in cache, which greatly reduces the penalty).

For general x86 load/store instructions, the penalty is speed, as more cycles are required to do the read or write. unalignment may also affect caching, leading to cache line splitting, and cache boundary straddling. It also prevents atomicity on reads and writes (which are guaranteed for all aligned read/writes of x86, barriers and propagation is something else, but using LOCK'ed instruction on unaligned data may cause and exception or greatly increase the already massive penalty the bu lock incurs), which is a no-no for concurrent programming.

Intels x86 & x64 optimizations manual goes into great detail about each aforementioned problem, their side-effects and how to remedy them.

Agner Fog' optimization manuals should have the exact numbers you are looking for in terms of raw cycle throughput.

这篇关于在 x86 上成功的未对齐访问的实际影响是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆