使用 x86/x64 组件进行旋转或移动 [英] Rotation or Shifting with x86/x64 Assembly

查看:37
本文介绍了使用 x86/x64 组件进行旋转或移动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我正在用汇编编写的函数,我想确定什么会给我带来最好的吞吐量.

I have function that I'm writing in assembly and I want to be sure what is going to give me the best throughput.

我在 RAX 中有一个 64 位值,我需要获取最高字节并对其执行一些操作,我想知道解决此问题的最佳方法是什么.

I have a 64bit value in RAX and I need to get the top most byte and perform some operations on it and I was wondering what is the best way of going about this.

shr  rax, 56    ; This will get me the most significant byte in al.

然而,这比...更有效吗

However, is this more effective than...

rol  rax, 8
and  rax, r12   ; I already have the value 255 in r12

我之所以这么问是因为在某些架构上,换档速度是您执行的换档次数的函数.如果我还记得,在 680x0 芯片上它是 6 + 2n,其中 n 是移位计数.我不认为这在 x86 架构上是正确的,但我不确定......所以人们的一些启发将不胜感激.(我了解延迟)

The reason why I'm asking is that on some architectures, shifting speed is a function of the number of shifts that you do. If I recall, on the 680x0 chips it was 6 + 2n where n was the shift count. I don't think this is true on x86 architectures, but I'm not sure... so some enlightenment from people would be appreciated. (I understand about latency)

或者是否有一种简单的方法可以将 RAX 的 0-31 位与 32-64 位交换而不是旋转或移位?就像交换在 680x0 上所做的那样.

Or is there an easy way to swap bits 0-31 of RAX with bits 32-64 rather than rotating or shifting? Something like what swap did on the 680x0.

推荐答案

根据 http://agner 上的说明表.org/optimize/, rol 立即计数是 Intel(Pentium M 到 Haswell)和 AMD(K8 到 Steamroller)上具有 1 个周期延迟的单 uop/m-op 指令).吞吐量范围从每时钟一个到每时钟三个.

According to the instruction tables at http://agner.org/optimize/, rol with an immediate count is a single-uop/m-op instruction with 1 cycle latency on Intel (Pentium M to Haswell) and AMD (K8 to Steamroller). Throughput ranges from one per clock to three per clock.

以变量计数 (rol r, cl) 旋转在 Intel 上速度较慢,在 AMD 上速度相同.

Rotate with a variable count (rol r, cl) is slower on Intel, same speed on AMD.

显然,如果您提出此类问题,请阅读 Agner Fog 的指南,因为除了单独使用单个指令之外,还有更多的高性能.

Obviously, read of Agner Fog's guides if you're asking this kind of question, since there's more to high performance than single instructions taken alone.

如果您在多个数据项上执行此操作,您可以一次在 16B(带有 SSE 的 xmm 寄存器)或 32B(带有 AVX 的 ymm 寄存器)块上使用向量混洗.pshufd xmm, xmm, imm 会让你为每个输出双字选择任何输入双字.(所以你可以广播和其他东西,以及简单的随机播放.)

If you're doing this on multiple data items, you could use vector shuffles on 16B (xmm registers with SSE) or 32B (ymm registers with AVX) chunks at once. pshufd xmm, xmm, imm will let you pick any input dword for each output dword. (So you can broadcast and stuff, as well as just plain shuffle.)

这篇关于使用 x86/x64 组件进行旋转或移动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆