为什么使用算术而不是_bittest以二进制形式打印数字更快 [英] why is it faster to print number in binary using arithmetic instead of _bittest

查看:102
本文介绍了为什么使用算术而不是_bittest以二进制形式打印数字更快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下两个代码部分的目的是以二进制形式打印数字.
第一个通过两个指令(_bittest)进行此操作,第二个通过纯算术指令(即三个指令)进行此操作.
第一个代码部分:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>

long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;
    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = _bittest(&num, nBit);
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

内部循环被编译为bt和setb指令. 第二段代码:

#include <intrin.h>
#include <stdio.h>  
#include <Windows.h>
long num = 78002;
int main()
{
    unsigned char bits[32];
    long nBit;

    LARGE_INTEGER a, b, f;
    QueryPerformanceCounter(&a);
    for (size_t i = 0; i < 100000000; i++)
    {
        long curBit = 1;
        for (nBit = 0; nBit < 31; nBit++)
        {
            bits[nBit] = (num&curBit) >> nBit;
            curBit <<= 1;
        }
    }
    QueryPerformanceCounter(&b);
    QueryPerformanceFrequency(&f);
    printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);

    printf_s("Binary representation:\n");
    while (nBit--)
    {
        if (bits[nBit])
            printf_s("1");
        else
            printf_s("0");
    }
    return 0;
}

内部循环编译并添加到sar(向左移).
第二个代码段的运行时间比第一个快三倍.


为什么3条CPU指令的运行速度快于2条?

解决方案

我假设您使用的是x86-64 MSVC CL19(或类似代码的东西).

_bittest较慢,因为MSVC做得很糟糕,并将值保留在内存中,并且bt [mem], regbt reg,reg慢了许多. 这是编译器未优化的目标.即使将num设置为局部变量而不是全局变量,即使初始化程序仍保持不变,也会发生这种情况!

我对英特尔Sandybridge系列CPU进行了性能分析,因为它们很常见.您没有说,是的,这很重要:bt [mem], reg在Ryzen上每3个周期的吞吐量中有一个,在Haswell上每5个周期的吞吐量中有一个.其他性能特征也有所不同...

(对于仅查看asm而言,使用args创建函数以获取编译器无法对其进行常量传播的代码通常是个好主意.在这种情况下,因为它不知道是否在main运行之前,任何内容都会修改num,因为它不是static.)


您的指令计数不包括整个循环,因此您的计数是错误的,但更重要的是,您没有考虑不同指令的不同成本. (请参见 Agner Fog的说明表和优化手册.)

这是使用_bittest内在函数的整个内部循环,具有Haswell/Skylake的uop计数:

    for (nBit = 0; nBit < 31; nBit++) {
        bits[nBit] = _bittest(&num, nBit);
        //bits[nBit] = (bool)(num & (1UL << nBit));   // much more efficient
    }

<强> 从MSVC CL19输出ASM 在Godbolt编译器资源管理器上

$LL7@main:
    bt       DWORD PTR num, ebx          ; 10 uops (microcoded), one per 5 cycle throughput
    lea      rcx, QWORD PTR [rcx+1]      ; 1 uop
    setb     al                          ; 1 uop
    inc      ebx                         ; 1 uop
    mov      BYTE PTR [rcx-1], al        ; 1 uop (micro-fused store-address and store-data)
    cmp      ebx, 31
    jb       SHORT $LL7@main             ; 1 uop (macro-fused with cmp)

这是15个融合域uops,因此它可以在3.75个周期内发出(每个时钟4个).但这不是瓶颈:Agner Fog的测试发现bt [mem], reg的吞吐量为每5个时钟周期之一.

IDK为什么它比其他循环慢3倍.也许其他ALU指令与bt争用同一个端口,或者它的一部分数据依赖性导致问题,或者仅是微编码指令就是问题,或者外循环效率较低?

无论如何,使用bt [mem], reg代替bt reg, reg是主要的优化遗漏. 1 uop,1c延迟,每时钟2个吞吐量bt r9d, ebx,此循环会比其他循环快.


内部循环编译为并添加(向左移动)和sar.

嗯?这些就是MSVC与curBit <<= 1;源代码行相关联的指令(即使该行完全由add self,self实现,并且可变计数算术右移是另一行的一部分.)

但是整个循环都是笨拙的混乱:

    long curBit = 1;
    for (nBit = 0; nBit < 31; nBit++)  {
        bits[nBit] = (num&curBit) >> nBit;
        curBit <<= 1;
    }

$LL18@main:               # MSVC CL19  -Ox
    mov      ecx, ebx                  ; 1 uop
    lea      r8, QWORD PTR [r8+1]      ; 1 uop   pointer-increment for bits
    mov      eax, r9d                  ; 1 uop.  r9d holds num
    inc      ebx                       ; 1 uop
    and      eax, edx                  ; 1 uop
       # MSVC says all the rest of these instructions are from             curBit <<= 1; but they're obviously not.
    add      edx, edx                  ; 1 uop
    sar      eax, cl                   ; 3 uops (variable-count shifts suck)
    mov      BYTE PTR [r8-1], al       ; 1 uop (micro-fused)
    cmp      ebx, 31
    jb       SHORT $LL18@main         ; 1 uop (macro-fused with cmp)

因此,这是11个融合域的uops,每次迭代需要2.75个时钟周期才能从前端发出.

我看不到任何循环承载的dep链长于该前端瓶颈,因此它运行的速度可能如此之快.

每次迭代都将ebx复制到ecx而不是仅将ecx作为循环计数器(nBit),这显然是错过的优化. cl中需要使用移位计数来进行可变计数移位(除非您启用了BMI2指令,否则MSVC甚至可以做到这一点.)

此处(快速"版本中)缺少主要的优化,因此您可能应该以不同的方式编写源代码,以使编译器减少错误代码的生成.它是从字面上实现的,而不是将其转换为CPU可以有效执行的操作,或者使用bt reg,reg/setc


如何在asm或内在函数中快速做到这一点

使用SSE2/AVX.将正确的字节(包含相应的位)放入向量的每个字节元素中,并使用具有该元素的正确位的掩码将PANDN(用于反转向量). PCMPEQB相对于零.那给你0/-1.要获取ASCII数字,请使用_mm_sub_epi8(set1('0'), mask)将ASCII '0'减去0或-1(加0或1),有条件地将其转换为'1'.

此操作的第一步(从位掩码获得0/-1的向量)是最快将32位数据解压缩为32字节SIMD向量的方法(版本为128b).如果没有SSSE3(pshufb),我认为punpcklbw/punpcklwd(也许还有pshufd)是您需要重复num的每个字节8次并生成两个16字节向量的方法.

  • 是否存在反指令到intel avx2中的movemask指令?.
  • 在标量代码中,这是一种以每个时钟1位->字节运行的方式.可能有不使用SSE2就能做得更好的方法(一次存储多个字节以使当前所有CPU上存在的每个时钟瓶颈绕1个存储区),但是为什么要打扰呢?只需使用SSE2.

      mov    eax, [num]
      lea    rdi, [rsp + xxx]  ; bits[]
    .loop:
        shr   eax, 1     ; constant-count shift is efficient (1 uop).  CF = last bit shifted out
        setc  [rdi]      ; 2 uops, but just as efficient as setc reg / mov [mem], reg
    
        shr   eax, 1
        setc  [rdi+1]
    
        add   rdi, 2
        cmp   end_pointer    ; compare against another register instead of a separate counter.
        jb   .loop
    

    展开两次以避免前端出现瓶颈,因此每个时钟可以运行1位.

    The purpose of the next two code section is to print number in binary.
    The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
    the first code section:

    #include <intrin.h>
    #include <stdio.h>  
    #include <Windows.h>
    
    long num = 78002;
    int main()
    {
        unsigned char bits[32];
        long nBit;
        LARGE_INTEGER a, b, f;
        QueryPerformanceCounter(&a);
        for (size_t i = 0; i < 100000000; i++)
        {
            for (nBit = 0; nBit < 31; nBit++)
            {
                bits[nBit] = _bittest(&num, nBit);
            }
        }
        QueryPerformanceCounter(&b);
        QueryPerformanceFrequency(&f);
        printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
    
        printf_s("Binary representation:\n");
        while (nBit--)
        {
            if (bits[nBit])
                printf_s("1");
            else
                printf_s("0");
        }
        return 0;
    }
    

    the inner loop is compile to the instructions bt and setb
    The second code section:

    #include <intrin.h>
    #include <stdio.h>  
    #include <Windows.h>
    long num = 78002;
    int main()
    {
        unsigned char bits[32];
        long nBit;
    
        LARGE_INTEGER a, b, f;
        QueryPerformanceCounter(&a);
        for (size_t i = 0; i < 100000000; i++)
        {
            long curBit = 1;
            for (nBit = 0; nBit < 31; nBit++)
            {
                bits[nBit] = (num&curBit) >> nBit;
                curBit <<= 1;
            }
        }
        QueryPerformanceCounter(&b);
        QueryPerformanceFrequency(&f);
        printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
    
        printf_s("Binary representation:\n");
        while (nBit--)
        {
            if (bits[nBit])
                printf_s("1");
            else
                printf_s("0");
        }
        return 0;
    }
    

    The inner loop compile to and add(as shift left) and sar.
    the second code section run three time faster then the first one.


    Why three cpu instructions run faster then two?

    解决方案

    I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).

    _bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!

    I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...

    (For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)


    Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)

    This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:

        for (nBit = 0; nBit < 31; nBit++) {
            bits[nBit] = _bittest(&num, nBit);
            //bits[nBit] = (bool)(num & (1UL << nBit));   // much more efficient
        }
    

    Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer

    $LL7@main:
        bt       DWORD PTR num, ebx          ; 10 uops (microcoded), one per 5 cycle throughput
        lea      rcx, QWORD PTR [rcx+1]      ; 1 uop
        setb     al                          ; 1 uop
        inc      ebx                         ; 1 uop
        mov      BYTE PTR [rcx-1], al        ; 1 uop (micro-fused store-address and store-data)
        cmp      ebx, 31
        jb       SHORT $LL7@main             ; 1 uop (macro-fused with cmp)
    

    That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.

    IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?

    Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.


    The inner loop compile to and add(as shift left) and sar.

    Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)

    But the whole loop is this clunky mess:

        long curBit = 1;
        for (nBit = 0; nBit < 31; nBit++)  {
            bits[nBit] = (num&curBit) >> nBit;
            curBit <<= 1;
        }
    
    $LL18@main:               # MSVC CL19  -Ox
        mov      ecx, ebx                  ; 1 uop
        lea      r8, QWORD PTR [r8+1]      ; 1 uop   pointer-increment for bits
        mov      eax, r9d                  ; 1 uop.  r9d holds num
        inc      ebx                       ; 1 uop
        and      eax, edx                  ; 1 uop
           # MSVC says all the rest of these instructions are from             curBit <<= 1; but they're obviously not.
        add      edx, edx                  ; 1 uop
        sar      eax, cl                   ; 3 uops (variable-count shifts suck)
        mov      BYTE PTR [r8-1], al       ; 1 uop (micro-fused)
        cmp      ebx, 31
        jb       SHORT $LL18@main         ; 1 uop (macro-fused with cmp)
    

    So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.

    I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.

    Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)

    There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc


    How to do this fast in asm or with intrinsics

    Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.

    The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.

    In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.

      mov    eax, [num]
      lea    rdi, [rsp + xxx]  ; bits[]
    .loop:
        shr   eax, 1     ; constant-count shift is efficient (1 uop).  CF = last bit shifted out
        setc  [rdi]      ; 2 uops, but just as efficient as setc reg / mov [mem], reg
    
        shr   eax, 1
        setc  [rdi+1]
    
        add   rdi, 2
        cmp   end_pointer    ; compare against another register instead of a separate counter.
        jb   .loop
    

    Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

    这篇关于为什么使用算术而不是_bittest以二进制形式打印数字更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆