为什么以及在何处使用 align 16 进行 SSE 对齐以获取指令? [英] Why and where align 16 is used for SSE alignment for instructions?

查看:48
本文介绍了为什么以及在何处使用 align 16 进行 SSE 对齐以获取指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 Apress 的现代 x86 汇编语言书籍.对于编程 64 位 SSE 示例,作者将 align 16 放在代码中的特定点上.例如

I am reading Modern x86 Assembly language book from Apress. For programming 64 bit SSE examples the author puts align 16 to a particular point in the code. E.g

    .code
ImageUint8ToFloat_ proc frame
_CreateFrame U2F_,0,64               ; helper macros to create prolog
_SaveXmmRegs xmm10,xmm11,xmm12,xmm13 ; helper macros to create prolog

_EndProlog  ; helper macros to create prolog

...

shrd r8d,
pxor xmm5,xmm5

align 16  ; Why this is here ?
@@:
movdqa xmm0,xmmword ptr [rdx]
movdqa xmm10,xmmword ptr [rdx+16]

movdqa xmm2,xmm0
punpcklbw xmm0,xmm5
punpckhbw xmm2,xmm5
movdqa xmm1,xmm0
movdqa xmm3,xmm2

...

作者解释说有必要放置align 16,因为我们使用的是SSE,以便指令本身对齐.没关系.我的问题是为什么作者选择将 align 16 放在该特定位置.作为程序员,我应该如何决定 align 16 的正确位置?为什么不早一点或晚一点?

The author explains it is necessary to put align 16 since we are using SSE so that instructions themselves are aligned. That's fine. My question is why the author choose to put align 16 to that particular location. As a programmer how should I decide for the correct location of align 16 ? Why not earlier or later ?

推荐答案

没有必要.偶尔有益.

现代处理器以 16 字节(或者可能是 32 字节,AMD 做一些奇怪的事情)字节为单位获取代码.当然是对齐的.如果你跳到这样一个块的末尾,你会浪费大部分的提取,并且在那个周期中你只解码了 1 条或许多 0 指令.这是一个巨大的浪费,所以最好跳到一个块的开头.

Modern processors fetch code in blocks of 16 (or maybe 32, sort of, AMD does weird things) bytes. Aligned, of course. If you jump near the end of such a block, you waste most of that fetch, and in that cycle you decode only 1 or many 0 instructions. That's a giant waste, so it's better to jump to the start of a block.

这并不总是重要的,例如代码是否在循环缓冲区或 µop 缓存中(如果存在).通常,几乎所有循环都适合 µops 缓存,在比 SandyBridge 更早的处理器上,很容易创建一个不适合循环缓冲区的循环,这使得获取吞吐量变得很重要.即使循环可以放入循环缓冲区,对齐仍然对 Core2 有所帮助,因为未对齐有效地使循环缓冲区更小(它基于 16 字节的代码块,在预解码后缓存).还有一些更奇怪的细节,但都是关于古代 µarchs 的,所以我会跳过它.关键是,在像 Nehalem 和更老的 µarchs 上,你应该经常对齐循环.

That doesn't always matter, for example if the code is in the loop buffer or µop cache (if it exists). Typically just about any loops fits in the µops cache, on processors older than SandyBridge it was fairly easy to make a loop that didn't fit in the loop buffer, making fetch throughput important. Even when loops could fit in the loop buffer, alignment still helped on Core2 because misalignment effectively makes the loop buffer smaller there (it is based on the 16byte blocks of code, cached after predecoding). There are some more weird details, but it's all about ancient µarchs so I'll skip it. The point is, on µarchs like Nehalem and older, you should often align loops.

虽然从片段中看不是很清楚,但看起来他们已经对齐了一个标签,它将循环返回.所以它正在对齐循环.这对现代 µarch 并不重要.

Though it's not super clear from the fragment, it looks like they've aligned a label to which it will loop back. So it's aligning the loop. It's not important on modern µarchs.

这篇关于为什么以及在何处使用 align 16 进行 SSE 对齐以获取指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆