为什么使用算术而不是_bittest以二进制形式打印数字更快 [英] why is it faster to print number in binary using arithmetic instead of _bittest
问题描述
下两个代码部分的目的是以二进制形式打印数字.
第一个通过两个指令(_bittest)进行此操作,第二个通过纯算术指令(即三个指令)进行此操作.
第一个代码部分:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
内部循环被编译为bt和setb指令. 第二段代码:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
内部循环编译并添加到sar(向左移).
第二个代码段的运行时间比第一个快三倍.
为什么3条CPU指令的运行速度快于2条?
我假设您使用的是x86-64 MSVC CL19(或类似代码的东西).
_bittest
较慢,因为MSVC做得很糟糕,并将值保留在内存中,并且bt [mem], reg
比bt reg,reg
慢了许多. 这是编译器未优化的目标.即使将num
设置为局部变量而不是全局变量,即使初始化程序仍保持不变,也会发生这种情况!
我对英特尔Sandybridge系列CPU进行了性能分析,因为它们很常见.您没有说,是的,这很重要:bt [mem], reg
在Ryzen上每3个周期的吞吐量中有一个,在Haswell上每5个周期的吞吐量中有一个.其他性能特征也有所不同...
(对于仅查看asm而言,使用args创建函数以获取编译器无法对其进行常量传播的代码通常是个好主意.在这种情况下,因为它不知道是否在main
运行之前,任何内容都会修改num
,因为它不是static
.)
您的指令计数不包括整个循环,因此您的计数是错误的,但更重要的是,您没有考虑不同指令的不同成本. (请参见 Agner Fog的说明表和优化手册.)
这是使用_bittest
内在函数的整个内部循环,具有Haswell/Skylake的uop计数:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
<强> 从MSVC CL19输出ASM
$LL7@main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7@main ; 1 uop (macro-fused with cmp)
这是15个融合域uops,因此它可以在3.75个周期内发出(每个时钟4个).但这不是瓶颈:Agner Fog的测试发现bt [mem], reg
的吞吐量为每5个时钟周期之一.
IDK为什么它比其他循环慢3倍.也许其他ALU指令与bt
争用同一个端口,或者它的一部分数据依赖性导致问题,或者仅是微编码指令就是问题,或者外循环效率较低?>
无论如何,使用bt [mem], reg
代替bt reg, reg
是主要的优化遗漏. 1 uop,1c延迟,每时钟2个吞吐量bt r9d, ebx
,此循环会比其他循环快.
内部循环编译为并添加(向左移动)和sar.
嗯?这些就是MSVC与curBit <<= 1;
源代码行相关联的指令(即使该行完全由add self,self
实现,并且可变计数算术右移是另一行的一部分.)
但是整个循环都是笨拙的混乱:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18@main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18@main ; 1 uop (macro-fused with cmp)
因此,这是11个融合域的uops,每次迭代需要2.75个时钟周期才能从前端发出.
我看不到任何循环承载的dep链长于该前端瓶颈,因此它运行的速度可能如此之快.
每次迭代都将ebx
复制到ecx
而不是仅将ecx
作为循环计数器(nBit
),这显然是错过的优化. cl
中需要使用移位计数来进行可变计数移位(除非您启用了BMI2指令,否则MSVC甚至可以做到这一点.)
此处(快速"版本中)缺少主要的优化,因此您可能应该以不同的方式编写源代码,以使编译器减少错误代码的生成.它是从字面上实现的,而不是将其转换为CPU可以有效执行的操作,或者使用bt reg,reg
/setc
如何在asm或内在函数中快速做到这一点
使用SSE2/AVX.将正确的字节(包含相应的位)放入向量的每个字节元素中,并使用具有该元素的正确位的掩码将PANDN(用于反转向量). PCMPEQB相对于零.那给你0/-1.要获取ASCII数字,请使用_mm_sub_epi8(set1('0'), mask)
将ASCII '0'
减去0或-1(加0或1),有条件地将其转换为'1'
.
此操作的第一步(从位掩码获得0/-1的向量)是最快将32位数据解压缩为32字节SIMD向量的方法(版本为128b).如果没有SSSE3(pshufb
),我认为punpcklbw
/punpcklwd
(也许还有pshufd
)是您需要重复num
的每个字节8次并生成两个16字节向量的方法.
在标量代码中,这是一种以每个时钟1位->字节运行的方式.可能有不使用SSE2就能做得更好的方法(一次存储多个字节以使当前所有CPU上存在的每个时钟瓶颈绕1个存储区),但是为什么要打扰呢?只需使用SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
展开两次以避免前端出现瓶颈,因此每个时钟可以运行1位.
The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?
I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest
is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg
is much slower than bt reg,reg
. This is a compiler missed-optimization. It happens even if you make num
a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg
has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num
before main
runs, because it's not static
.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest
intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox
on the Godbolt compiler explorer
$LL7@main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7@main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg
has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt
, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg
instead of bt reg, reg
is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx
.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1;
source line (even though that line is fully implemented by the add self,self
, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18@main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18@main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx
to ecx
every iteration instead of just using ecx
as the loop counter (nBit
) is an obvious missed optimization. The shift-count is needed in cl
for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg
/ setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask)
to subtract 0 or -1 (add 0 or 1) to ASCII '0'
, conditionally turning it into '1'
.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
- Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (
pshufb
), I thinkpunpcklbw
/punpcklwd
(and maybepshufd
) is what you need to repeat each byte ofnum
8 times and make two 16-byte vectors. - is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.
这篇关于为什么使用算术而不是_bittest以二进制形式打印数字更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!