字对齐加载快于x64处理器的负载不对齐 [英] Are word-aligned loads faster than an unaligned loads on x64 processors

查看:136
本文介绍了字对齐加载快于x64处理器的负载不对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是被按字边界不是x86 / 64(Intel / AMD的64位)处理器未对齐加载操作变得更快对齐的变量负荷。

Are loads of variables that are aligned on word boundaries faster than unaligned load operations on x86/64 (Intel/AMD 64 bit) processors.

我的一位同事认为,未对齐的负载很慢,应该尽量避免。他列举项目单词边界的结构作为一个证明,未对齐的加载速度很慢填充。例如:

A colleague of mine argues that unaligned loads are slow and should be avoided. He cites the padding of items to word boundaries in structs as a proof that unaligned loads are slow. Example:

struct A {
  char a;
  uint64_t b;
};

该结构A作为通常的大小为16字节。

The struct A as usually a size of 16 bytes.

在另一方面:这个短小精悍的文档COM pressor ​​指出,它假定未对齐的32位和64位的加载和存储很便宜。据消息人士透露code,这是英特尔32位和64位处理器也是如此。

On the other hand: The documentation of the Snappy compressor states that it assumes that "unaligned 32- and 64-bit loads and stores are cheap". According to the source code this is true of Intel 32 and 64-bit processors.

所以:真相是什么吗?如果与多少是不对齐的加载慢?在什么情况下?

So: What is the truth here? If and by how much are unaligned loads slower? Under which circumstances?

推荐答案

一个随机盖伊在互联网上,我发现说,对486说,对齐32位访问需要一个周期。跨越四边形,但同一高速缓存行中未对齐的32位访问占用4个周期。未对齐等跨越的多个的高速缓存行可以采取的额外的的六到​​十次。

A Random Guy On The Internet I've found says that for the 486 says that an aligned 32-bit access takes one cycle. An unaligned 32-bit access that spans quads but is within the same cache line takes four cycles. An unaligned etc that spans multiple cache lines can take an extra six to twelve cycles.

由于不对齐访问需要访问存储器的多个四边形,pretty的定义很多,我不是这个感到惊讶。我想象,在现代处理器更好的缓存性能使得成本少一点坏,但它仍然是要避免的东西。

Given that an unaligned access requires accessing multiple quads of memory, pretty much by definition, I'm not at all surprised by this. I'd imagine that better caching performance on modern processors makes the cost a little less bad, but it's still something to be avoided.

(顺便说一下,如果你的code具有的任何的pretensions便携性... IA32和后裔pretty多,支持未对齐唯一的现代建筑访问的。 ARM的,例如,可以非常抛出异常,模拟在软件的访问,或者仅仅之间装错值的,这取决于操作系统!)

(Incidentally, if your code has any pretensions to portability... ia32 and descendants are pretty much the only modern architectures that support unaligned accesses at all. ARM, for example, can very between throwing an exception, emulating the access in software, or just loading the wrong value, depending on OS!)

更新:这里的人究竟是谁去了,测量它 。在他的硬件,他估计对齐访问到一半尽可能快地对齐。去试试吧留给自己......

Update: Here's someone who actually went and measured it. On his hardware he reckons unaligned access to be half as fast as aligned. Go try it for yourself...

这篇关于字对齐加载快于x64处理器的负载不对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆