最快64位的人口数量(汉明权重) [英] Fastest 64-bit population count (Hamming weight)

查看:595
本文介绍了最快64位的人口数量(汉明权重)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不得不计算汉明权重为64位数据的比较快continious流量和使用 POPCNT 汇编指令引发了我的异常OM我的英特尔酷睿i7- 4650U。

I had to calculate the Hamming weight for a quite fast continious flow of 64-bit data and using the popcnt assembly instruction throws me a exception om my Intel Core i7-4650U.

我检查我的圣经黑客的喜悦和扫描网络各种算法(它是一堆在那里,因为他们开始在解决计算的诞生这个问题)。

I checked my bible Hacker's delight and scanned the web for all kinds of algorithms (it's a bunch out there since they started tackling this 'problem' at the birth of computing).

我度过了周末,与我自己的一些想法玩耍,并与这些算法,在那里我几乎在我可以进出CPU的移动数据的速度走了过来。

I spent the weekend playing around with some ideas of my own and came up with these algorithms, where I'm almost at the speed I can move data in and out of the CPU.

    //64-bit popcnt using BMI2
_popcnt_bmi2:
        mov         (%rdi),%r11
        pext        %r11,%r11,%r11
        not         %r11
        tzcnt       %r11,%r11
        mov         %r11,(%rdx)
        add         $8h,%rdi
        add         $8h,%rdx
        dec         %rsi
        jnz         _popcnt_bmi2
        ret

在上面的code我用 PEXT (BMI2)即输入的数据是使用本身作为面具。然后所有位现有将(本身重新)折叠开始在结果寄存器中的至少显著位。然后,我需要计算倒塌的比特数,所以我所有的反转位,然后使用 tzcnt 来算的,现在的零数。我认为这是一个相当不错的主意。

In the above code I use pext (BMI2) where the incoming data is using itself as the mask. Then all bits existing will collapse starting with the least significant bit in the result register (itself again). Then I need to calculate the number of collapsed bits so I invert all bits then use tzcnt to count the number of, now zeroes. I thought it was a quite nice idea.

然后我也尝试了AVX2方式:

Then I also tried a AVX2 approach:

//64-bit popcnt using AVX2
_popcnt_avx2:
        vmovdqa     (%rcx),%ymm2
        add         $20h,%rcx
        vmovdqa     (%rcx),%ymm3
        add         $20h,%rcx
        vmovdqa     (%rcx),%ymm4
popcnt_avx2_loop:
        vmovdqa     (%rdi),%ymm0
        vpand       %ymm0, %ymm2, %ymm1
        vpandn      %ymm0, %ymm2, %ymm0
        vpsrld      $4h,%ymm0, %ymm0
        vpshufb     %ymm1, %ymm3, %ymm1
        vpshufb     %ymm0, %ymm3, %ymm0
        vpaddb      %ymm1,%ymm0,%ymm0       //popcnt (8-bits)
        vpsadbw     %ymm0,%ymm4,%ymm0       //popcnt (64-bits)
        vmovdqa     %ymm0,(%rdx)
        add         $20h,%rdi
        add         $20h,%rdx
        dec         %rsi
        jnz         popcnt_avx2_loop

在我读32个字节的AVX2话,那么屏蔽掉半字节( ymm2 ),然后使用 ymm3 为位计数半字节查找表。然后我加入的结果,8位的,然后我用凝聚了超 vpsadbw 8个字节增加至64位的值( ymm4 = 0)。

In the AVX2 case I read 32 bytes, then mask out the nibbles (ymm2), then I use ymm3 as a look up table for bit counting the nibbles. Then I add the results to 8-bit's, and then I use the super condensed vpsadbw to add 8 bytes to a 64-bit value (ymm4 = 0).

任何人有更快的东西了他们sleves?

Anyone got something faster up their sleves?

编辑:

有故障的 POPCNT ,是由于对我在code做了一个错误,该函数的作品奥姆我的英特尔酷睿i7-4650U。请参阅下面显示的板凳结果我的职务。

The failing POPCNT was due to to a error I made in my code, that function works om my Intel Core i7-4650U. Please see my post below displaying the bench results.

推荐答案

OK得出的结论,这是没有试图成为智能的想法,我坐在板凳上:

OK came to the conclusion that it was no idea of trying to be 'smart', I benched:

内置内在popcount: _mm_popcnt_u64

the built in intrinsic popcount: _mm_popcnt_u64

bmi2: __ tzcnt_u64(〜_pext_u64(数据[I],数据[I]));
对三名汇编函数

bmi2:__tzcnt_u64(~_pext_u64(data[i],data[i])); against three assembler functions

POPCNT,bmi2和AVX2。

popcnt, bmi2 and avx2.

他们都在,你可以在移动存储器和速度用完我的:

They all run at the speed you can move memory in and out of my:

cat /proc/cpuinfo

- 英特尔(R)至强(R)CPU E3-1275 V3 @ 3.50GHz

-Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz

FYI:

main.c中:

// Hamming weight bench

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <smmintrin.h>
#include <immintrin.h>
#include <x86intrin.h>
#include <math.h>

#define DISPLAY_HEIGHT  4
#define DISPLAY_WIDTH   32
#define NUM_DATA_OBJECTS  40000000
#define ITTERATIONS 20

// The source data (+32 to avoid the quantization out of memory problem)
__attribute__ ((aligned(32))) static long long unsigned data[NUM_DATA_OBJECTS+32]={};
__attribute__ ((aligned(32))) static long long unsigned data_out[NUM_DATA_OBJECTS+32]={};
__attribute__ ((aligned(32))) static unsigned char k1[32*3]={
    0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,
    0x00,0x01,0x01,0x02,0x01,0x02,0x02,0x03,0x01,0x02,0x02,0x03,0x02,0x03,0x03,0x04,0x00,0x01,0x01,0x02,0x01,0x02,0x02,0x03,0x01,0x02,0x02,0x03,0x02,0x03,0x03,0x04,
    0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
};


extern "C" {
void popcnt_popcnt(long long unsigned[],unsigned int,long long unsigned[]);
void popcnt_bmi2(long long unsigned[],unsigned int,long long unsigned[]);
void popcnt_avx2(long long unsigned[],unsigned int,long long unsigned[],unsigned char[]);
}

void populate_data()
{
    for(unsigned int i = 0; i < NUM_DATA_OBJECTS; i++)
    {
        data[i] = rand();
    }
}

void display_source_data()
{
    printf ("\r\nData in(start):\r\n");
    for (unsigned int j = 0; j < DISPLAY_HEIGHT; j++)
    {
        for (unsigned int i = 0; i < DISPLAY_WIDTH; i++)
        {
            printf ("0x%02llux,",data[i+(j*DISPLAY_WIDTH)]);
        }
        printf ("\r\n");
    }
}

void bench_popcnt()
{
    for(unsigned int i = 0; i < NUM_DATA_OBJECTS; i++)
    {
        data_out[i] = _mm_popcnt_u64(data[i]);
    }
}

void bench_move_data_memcpy()
{
    memcpy(data_out,data,NUM_DATA_OBJECTS*8);
}

// __tzcnt64 ??
void bench_bmi2()
{
    for(unsigned int i = 0; i < NUM_DATA_OBJECTS; i++)
    {
        data_out[i]=__tzcnt_u64(~_pext_u64(data[i],data[i]));
    }
}

void display_dest_data()
{
    printf ("\r\nData out:\r\n");
    for (unsigned int j = 0; j < DISPLAY_HEIGHT; j++)
    {
        for (unsigned int i = 0; i < DISPLAY_WIDTH; i++)
        {
            printf ("0x%02llux,",data_out[i+(j*DISPLAY_WIDTH)]);
        }
        printf ("\r\n");
    }
}


int main() {
    struct timeval t0;
    struct timeval t1;
    long elapsed[ITTERATIONS]={0};
    long avrg=0;

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        bench_move_data_memcpy();
        gettimeofday(&t1, 0);
        elapsed[i]= (((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000);
        printf ("Time_to_move_data_without_processing: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average time_to_move_data: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        bench_popcnt();
        gettimeofday(&t1, 0);
        elapsed[i] = ((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000;
        printf ("popcnt: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average popcnt: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        bench_bmi2();
        gettimeofday(&t1, 0);
        elapsed[i] = ((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000;
        printf ("bmi2: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average bmi2: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();


    printf ("Now test the assembler functions\n");

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        popcnt_popcnt(data,NUM_DATA_OBJECTS,data_out);
        gettimeofday(&t1, 0);
        elapsed[i] = ((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000;
        printf ("popcnt_asm: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average popcnt_asm: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        popcnt_bmi2(data,NUM_DATA_OBJECTS,data_out);
        gettimeofday(&t1, 0);
        elapsed[i] = ((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000;
        printf ("bmi2_asm: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average bmi2_asm: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();

    for (unsigned int i = 0; i < ITTERATIONS; i++)
    {
        populate_data();
        //    display_source_data();
        gettimeofday(&t0, 0);
        popcnt_avx2(data,(unsigned int)ceil((NUM_DATA_OBJECTS*8)/32.0),data_out,k1);
        gettimeofday(&t1, 0);
        elapsed[i] = ((t1.tv_sec-t0.tv_sec)*1000000 + t1.tv_usec-t0.tv_usec)/1000;
        printf ("avx2_asm: %ld\n",elapsed[i]);
    }

    avrg=0;
    for (unsigned int i = 1; i < ITTERATIONS; i++){
        avrg+=elapsed[i];
    }
    printf ("Average avx2_asm: %ld\n",avrg/(ITTERATIONS-1));

    //display_dest_data();

    return 0;
}

该engine.s

The engine.s

//
//  avx2_bmi2_popcnt bench
//

.global popcnt_bmi2 , popcnt_avx2, popcnt_popcnt
.align 2

//64-bit popcnt using the built-in popcnt instruction
popcnt_popcnt:
        popcntq     (%rdi), %r11
        mov         %r11,(%rdx)
        add         $8,%rdi
        add         $8,%rdx
        dec         %rsi
        jnz         popcnt_popcnt
        ret

//64-bit popcnt using BMI2
popcnt_bmi2:
        mov         (%rdi),%r11
        pextq       %r11,%r11,%r11
        not         %r11
        tzcnt       %r11,%r11
        mov         %r11,(%rdx)
        add         $8,%rdi
        add         $8,%rdx
        dec         %rsi
        jnz         popcnt_bmi2
        ret

//64-bit popcnt using AVX2
popcnt_avx2:
        vmovdqa     (%rcx),%ymm2
        add         $0x20,%rcx
        vmovdqa     (%rcx),%ymm3
        add         $0x20,%rcx
        vmovdqa     (%rcx),%ymm4
popcnt_avx2_loop:
        vmovdqa     (%rdi),%ymm0
        vpand       %ymm0, %ymm2, %ymm1
        vpandn      %ymm0, %ymm2, %ymm0
        vpsrld      $4,%ymm0, %ymm0
        vpshufb     %ymm1, %ymm3, %ymm1
        vpshufb     %ymm0, %ymm3, %ymm0
        vpaddb      %ymm1,%ymm0,%ymm0
        vpsadbw     %ymm0,%ymm4,%ymm0
        vmovdqa     %ymm0,(%rdx)
        add         $0x20,%rdi
        add         $0x20,%rdx
        dec         %rsi
        jnz         popcnt_avx2_loop
        ret

编译来源:

G ++ -march =本地-mavx -mpopcnt -O3的main.c engine.s

在CPU设置为性能:

CPU频率设定-g性能

运行替补:

须藤CHRT -r 10 ./a.out

结果:

平均time_to_move_data:61

平均POPCNT:61

平均bmi2:61

现在测试汇编函数

平均popcnt_asm:61

平均bmi2_asm:61

平均avx2_asm:61

这篇关于最快64位的人口数量(汉明权重)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆