是按位轮换比目前英特尔CPU轮班慢? [英] Are bitwise rotations slower than shifts on current Intel CPU?

查看:204
本文介绍了是按位轮换比目前英特尔CPU轮班慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇,如果 java.lang.Integer.rotateLeft 被使用旋转指令优化,并写了一个标杆吧。结果是不确定的:它比两班快很多,但不是单一慢一点。所以我重写了它在C ++和有大约相同的结果。当通过 G ++ -S -Wall -O3 编译我可以看到的生成的汇编。我的CPU是英特尔酷睿i5。

借助基准是很长,肯定不是最好的一块code,但我不认为这是坏了。或者是什么?根据文档的旋转需要一个周期,就像转变。任何人都可以解释的结果?

 旋转:6860
转变:5100


前两个答案都是错的。 GCC和Java的JIT知道旋转的指令并使用它们。关于GCC看到上面的链接,关于Java的看到我的的Java基准及其结果

 基准NS线运行
   旋转3.48 ====================
NonRotate 5.05 ==============================
    移2.16 ============


解决方案

我不知道gcc和java的JIT有能力认识到SHIFT和OR运算的序列可以减少到一个循环指令,很有意思的。

g ++编译器解开你的循环和使用 SHIFT立即旋转立即的说明(因为你移动和旋转常数值)。

下面是在时光平移循环展开的情况下重复了6个指令序列:

  MOVQ%RAX,RBX%
salq $ 13%RBX
leaq(RBP%,%RBX),%RBX
MOVQ%RDI,RBP%
SARQ $ 27%RBP
xorq%RBX,RDX%

下面是在TimeRotate循环展开的情况下重复了6个指令序列:

  MOVQ%的RDX,%RBX
rorq $ 45%RBX
leaq(RBP%,%RBX),%RBX
MOVQ%R8,RBP%
rorq $ 49%RBP
xorq%RBX,R9%

它们的主要区别在使用salq / SARQ为 SHIFT 和rorq为旋转让你在正确的想知道为什么时间是不同的。

答案就在的Sandy Bridge(您的酷睿i5处理器)的微架构和深在的英特尔®64和IA-32处理器架构优化参考手册
最新的是订单号:2012年4月248966-026

SHIFT 指令有1个周期的延迟是否使用 1 运算code或通过即时。该处理器可调度 - 它可以从端口0 端口1 ,为此有0.5周期吞吐量调度和退休两个每个周期SHIFT立即的说明。在旋转指令需要三个微操作是否需要条件标志的结果(它们不是由GCC生成的code)和两个微操作如果不是(在你的情况下,于是两个微操作)。在旋转指令,但是,只能从端口1 派出因此具有1个周期的吞吐量 - 处理器可以调度和退休只有一个每个周期ROTATE立即

我已经复制了相关的图像及以下部分。

3.5.1.5按位轮换

按位旋转可以选择数在CL指定轮流注册,一
立即数和1位。一般情况下,通过直接的旋转和旋转
注册指令是由1位比旋转慢。 1指令的旋转有
相同的延迟作为移位。
组装/编译器编码规则35.(ML撞击,L通用)避免旋转
通过即时指令寄存器或旋转。如果可能的话,请更换
通过1个指令旋转。

在英特尔微架构code名称的Sandy Bridge,ROL /通过直接ROR有1
循环可以通过,SHLD / SHRD使用由相同的寄存器的源和目的地
一个立即数有0.5周期吞吐量1个周期的延迟。在ROL / ROR
章,将imm8指令有两个微操作与1周期为旋转延迟
注册的结果,如果使用2个周期的标志。
在英特尔微架构code名称的Ivy Bridge的ROL / ROR章,将imm8指令立即大于1,一个微操作有一个周期的延迟时
溢出标志结果被使用。当眼前的是一,对溢出的依赖
通过后续指令ROL / ROR的标志结果可以看到ROL / ROR指令
与两周期的延迟。

2.4.4.2执行单元和问题端口

目前每个周期内,芯可以分派μops到一个或多个四个问题端口。在
微架构的水平,存储操作被进一步分为两个部分:店
数据和存储地址操作。这四个端口通过它μops分派
为执行单元和装载和存储操作如图2-6所示。一些
端口可以​​调度每个时钟两μops。这些执行单元标双
速度。

端口0。在周期的前半部分,端口0可以分派任何一个浮点
移动μop(浮点堆栈动,浮点交换或浮点
存储数据)或者一个算术逻辑运算单元(ALU)μop(算术,逻辑,分支机构或商店
数据)。在周期的后半期,可以调度种相似ALUμop。

端口1。在周期的前半部分,端口1可以分派任何一个浮点
执行(除移动所有浮点运算,所有SIMD操作)μop或
一个单个普通速度的整数(乘法,移位和旋转)μop或者一个ALU(算术)
μop。在周期的后半期,可以调度种相似ALUμop。

端口2。此端口支持每个周期负荷运行的调度。

端口3。此端口支持每个周期存储地址操作的调度。

总发行带宽范围可以从每个周期的零到六μops。每个管道
包含几个执行单元。的μops被分派到对应于正确类型的操作的流水线。例如,整数算术逻辑单元
和浮点执行单元(加法器,乘法器和除法器)可以共享一个
管道。

I was curious if java.lang.Integer.rotateLeft gets optimized by using a rotation instruction and wrote a benchmark for it. The results were inconclusive: It was much faster than two shifts but a bit slower than a single one. So I rewrote it in C++ and got about the same results. When compiling via g++ -S -Wall -O3 I can see the instruction in the generated assembler. My CPU is Intel Core i5.

The benchmark is quite long and surely not the nicest piece of code, but I don't think it's broken. Or is it? According to the documentation the rotations take one cycle, just like shifts. Can anybody explain the results?

rotations:  6860
shift:      5100


The first two answers are wrong. Both gcc and java's JIT know the rotation instructions and use them. Concerning gcc see the link above, concerning java see my java benchmark and its results

benchmark   ns linear runtime
   Rotate 3.48 ====================
NonRotate 5.05 ==============================
    Shift 2.16 ============

解决方案

I did not know that gcc and the java jit were capable of recognizing that a sequence of SHIFT and OR operators can be reduced to a ROTATE instruction, very interesting.

The g++ compiler unrolls your loops and uses SHIFT immediate and ROTATE immediate instructions (since you shift and rotate by constant values).

Here's the six instruction sequence that is repeated in the TimeShift loop unroll case:

movq    %rax, %rbx
salq    $13, %rbx
leaq    (%rbp,%rbx), %rbx
movq    %rdi, %rbp
sarq    $27, %rbp
xorq    %rbx, %rdx

Here's the six instruction sequence that is repeated in the TimeRotate loop unroll case:

movq    %rdx, %rbx
rorq    $45, %rbx
leaq    (%rbp,%rbx), %rbx
movq    %r8, %rbp
rorq    $49, %rbp
xorq    %rbx, %r9

They differ mainly in the use of salq/sarq for SHIFT and rorq for ROTATE so you are correct in wondering why the timing differs.

The answer lies deep in the micro-architecture of Sandy Bridge (your Core i5 processor) and is found in INTEL® 64 and IA-32 Processor Architectures Optimization Reference Manual The latest is Order Number: 248966-026 April 2012

The SHIFT instruction has 1 cycle latency whether you use the by 1 opcode or by immediate. It can dispatch from either Port 0 or Port 1 and for this reason has a 0.5 cycle throughput - the processor can dispatch and retire two SHIFT immediate instructions per cycle. The ROTATE instruction needs three micro-ops if the results of the condition flags are needed (they aren't in the code generated by gcc) and two micro-ops if not (so two micro-ops in your case). The ROTATE instruction, however, can only be dispatched from Port 1 and therefore has a 1 cycle throughput - the processor can dispatch and retire only one ROTATE immediate per cycle.

I've copied the relevant image and section below.

3.5.1.5 Bitwise Rotation

Bitwise rotation can choose between rotate with count specified in the CL register, an immediate constant and by 1 bit. Generally, The rotate by immediate and rotate by register instructions are slower than rotate by 1 bit. The rotate by 1 instruction has the same latency as a shift. Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid ROTATE by register or ROTATE by immediate instructions. If possible, replace with a ROTATE by 1 instruction. In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1- cycle throughput, SHLD/SHRD using the same register as source and destination by an immediate constant has 1-cycle latency with 0.5 cycle throughput. The "ROL/ROR reg, imm8" instruction has two micro-ops with the latency of 1-cycle for the rotate register result and 2-cycles for the flags, if used. In Intel microarchitecture code name Ivy Bridge, The "ROL/ROR reg, imm8" instruction with immediate greater than 1, is one micro-op with one-cycle latency when the overflow flag result is used. When the immediate is one, dependency on the overflow flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction with two-cycle latency.

2.4.4.2 Execution Units and Issue Ports

At each cycle, the core may dispatch µops to one or more of four issue ports. At the microarchitecture level, store operations are further divided into two parts: store data and store address operations. The four ports through which μops are dispatched to execution units and to load and store operations are shown in Figure 2-6. Some ports can dispatch two µops per clock. Those execution units are marked Double Speed.

Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move µop (a floating-point stack move, floating-point exchange or floating-point store data) or one arithmetic logical unit (ALU) µop (arithmetic, logic, branch or store data). In the second half of the cycle, it can dispatch one similar ALU µop.

Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) µop or one normal-speed integer (multiply, shift and rotate) µop or one ALU (arithmetic) µop. In the second half of the cycle, it can dispatch one similar ALU µop.

Port 2. This port supports the dispatch of one load operation per cycle.

Port 3. This port supports the dispatch of one store address operation per cycle.

The total issue bandwidth can range from zero to six µops per cycle. Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to the correct type of operation. For example, an integer arithmetic logic unit and the floating-point execution units (adder, multiplier, and divider) can share a pipeline.

这篇关于是按位轮换比目前英特尔CPU轮班慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆