字对齐加载比x64处理器上未对齐加载快吗? [英] Are word-aligned loads faster than unaligned loads on x64 processors?

查看:94
本文介绍了字对齐加载比x64处理器上未对齐加载快吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在字边界上对齐的变量加载比在x86 / 64(Intel / AMD 64位)处理器上未对齐的加载操作快吗?

Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors?

我的一位同事认为未对准的载荷很慢,应该避免。他引用了项目中填充到单词边界上的单词,以证明未对齐的加载很慢。示例:

A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example:

struct A {
  char a;
  uint64_t b;
};

结构A通常为16个字节。

The struct A as usually a size of 16 bytes.

另一方面, Snappy压缩器的文档指出Snappy假定未对齐的32位和64位加载和存储很便宜。根据源代码,这对于Intel 32位和64位处理器是正确的。

On the other hand, the documentation of the Snappy compressor states that Snappy assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of Intel 32 and 64-bit processors.

所以:这是什么?如果不对齐的负载变慢并且减慢多少?在什么情况下?

So: What is the truth here? If and by how much are unaligned loads slower? Under which circumstances?

推荐答案

互联网上的一个随机家伙我发现说486表示对齐的32-位访问需要一个周期。跨越四边形但在同一高速缓存行内的未对齐的32位访问需要四个周期。跨越多个缓存行的未对齐等可能需要额外六到十二个周期。

A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.

给出了一个未对齐的内容访问需要访问多个内存,顾名思义,我对此一点都不感到惊讶。我以为现代处理器上更好的缓存性能可以使成本降低一些,但是仍然需要避免。

Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.

(顺便说一句,如果您的代码具有 any 可移植性的代名词... ia32和其子代几乎是唯一完全支持未对齐访问的现代体系结构,例如ARM可以在抛出异常,在软件中模拟访问之间或仅在加载错误的值,具体取决于操作系统!)

(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)

更新:这是一个实际去过的人,并且对其进行了测量。在他的硬件上,他认为未对齐访问的速度是对齐速度的一半。去自己动手尝试...

Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...

这篇关于字对齐加载比x64处理器上未对齐加载快吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆