什么是返回一个64位整数所有设置位的位置的最快方法? [英] What is the fastest way to return the positions of all set bits in a 64-bit integer?

查看:363
本文介绍了什么是返回一个64位整数所有设置位的位置的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个快速的方式来获得所有的1位的一个64位整数位置。例如,给定 X = 123703 ,我想填充数组 IDX [] = {0,1,2,4,5, 8,9,13,14,15,16} 。我们可以假定,我们知道位的先验次数。这将被称为10 ^ 12 - 10 ^ 15倍,所以速度是最重要的。最快的答案,我想出迄今以下的怪物,它使用了64位整数作为索引的每个字节到,让在字节设置位的人的职位数量和表:

 的int64_t X; //这是输入
unsigned char型IDX [K] //这是设置K位的阵列
无符号字符* DST = IDX,* SRC;
unsigned char型零,一,二,三,四,五; //这些持有第0字节第5
零= X&放大器; 0x0000000000FFUL;
1 =(X安培; 0x00000000FF00UL)GT;> 8;
2 =(X安培; 0x000000FF0000UL)GT;> 16;
3 =(X安培; 0x0000FF000000UL)GT;> 24;
4 =(X安培; 0x00FF00000000UL)GT;> 32;
5 =(X安培; 0xFF0000000000UL)GT;> 40;
SRC = TAB0 + tabofs [ZERO]; COPY(DST,SRC,N [零]);
SRC = TAB1 + tabofs [一] COPY(DST,SRC,N [一]);
SRC = TAB2 + tabofs [二]; COPY(DST,SRC,N [二]);
SRC = TAB3 + tabofs [三]; COPY(DST,SRC,N [三]);
SRC = TAB4 + tabofs【四】; COPY(DST,SRC,正【四】);
SRC = tab5 + tabofs [五] COPY(DST,SRC,正[五]);

其中,复制是一个switch语句最多复制到8个字节, N 是的位数的数组在一个字节设置和 tabofs 给出了偏移量 tabX ,其中包含在X位的设置位置个字节。 这是大约3倍比 __ builtin_ctz我的至强E5-2609展开循环为基础的方法()更快。(见下文)。我目前迭代 X 在字典顺序为中置位给定数目。

有没有更好的办法?

修改:添加了一个例子(即我后来固定)。全code可以在这里找到: http://pastebin.com/79X8XL2P 。注:GCC与-O2似乎优化它了,但英特尔的编译器(这在我以前撰写的话)不...

另外,让我给一些额外的背景下,解决一些下面的意见。我们的目标是在K个变量的每个可能子集执行一个统计测试出N个可能的解释变量的宇宙的;具体的目标现在是N = 41,但我可以看到一些项目需要ň达45-50。测试基本上包括因式分解相应的数据子矩阵。在伪code,是这样的:

 双doTest(双*数据和int64_t模型){
  INT nidx,IDX [];
  双子矩阵[] [];
  nidx = getIndices(型号,IDX); //获取那些在模型中的位置
  //将数据复制到子阵
  的for(int i = 0; I< nidx;我++){
    对于(INT J = 0; J< nidx; J ++){
      子矩阵[I] [J] =数据[IDX [I] [IDX [J];
    }
  }
  比化(子矩阵,nidx);
  返回the_answer;
}

I codeD为英特尔披板应在15天左右,其中〜时间5-10%天真<$ C $花在完成N = 41的情况下一个版本的这个C> getIndices()所以马上蝙蝠更快的版本,可以节省一天或更长时间。我正在为NVIDIA开普勒实现过,但不幸的是我有(小矩阵运算可笑的数字)的问题并不适合硬件(大得离谱矩阵运算)。这就是说,本文的presents似乎达到数百GFLOPS / s的溶液通过积极地展开循环和在寄存器执行整个分解,与该基体的尺寸在编译时定义所述警告我的尺寸的矩阵。 (这个循环展开将有助于减少开销,提高披矢量版本了,所以 getIndices()将变得更加重要!)所以,现在我想我的内核应该看更像是:

 双*的数据; //一次将数据移动到GPU /披到共享内存中
模板&LT; unsigned int类型K&GT;双doTestUnrolled为(int * IDX){
  双子矩阵[K] [K];
  //将数据复制到子阵
  为#pragma unroll
  的for(int i = 0; I&LT; k;我++){
    为#pragma unroll
    对于(INT J = 0; J&LT; k; J ++){
      子矩阵[I] [J] =数据[IDX [I] [IDX [J];
    }
  }
  factorizeUnrolled&LT; K&GT;(子矩阵);
  返回the_answer;
}

披版本解决了各个模块的`cilk_for从模型回路= 0至2 ^ N(或者,更确切地说,用于测试的子集),但现在为了批处理工作,为GPU和分期偿还内核启动开销我在词典编纂顺序对每个K = 1至41位设置的迭代型号(如doynax说明)。

编辑2:现在假期结束了,下面是使用ICC版本15. code,我用基准我的至强E5-2602了一定的成果是在这里:的 http://pastebin.com/XvrGQUat 。我对那个恰好有K位设置整数位提取,所以有一些开销在下面的表中的基地列测量的字典迭代。这些与N = 48(重复视需要)进行2 ^ 30倍。

CTZ是使用一个循环的gcc的内在 __ builtin_ctzll 来获得最低阶位设置:

 的for(int i = 0; I&LT; k;我++){
    IDX [I] = __builtin_ctzll(TMP);
    磅= TMP和放大器; -tmp; //获取最低位
    TMP = ^磅; //删除TMP最低位
}

标记是记号的网点for循环:

 的for(int i = 0; I&LT; k;我++){
    * DST = I;
    DST + = X&放大器; 1;
    X - GT;&GT; = 1;
}

TAB1是我原来的基于表格的code具有以下复印宏:

 的#define COPY(D,S,N)\\
开关(N){\\
壳体8:*(D +)= *(S +); \\
情况7:*(D +)= *(S +); \\
情况6:*(D +)= *(S +); \\
情况5:*(D +)= *(S +); \\
情况4:*(D +)= *(S +); \\
情况3:*(D +)= *(S +); \\
案例2:*(D +)= *(S +); \\
案例1:*(D +)= *(S +); \\
情况下0:打破; \\
}

TAB2是一样的code作为TAB1,但复制宏只是移动8个字节为单拷贝(取从doynax和琉永福的想法......但请注意这确实的的确保对齐):

 的#define COPY2(D,S,N){*((uint64_t中*)D)= *((uint64_t中*)S); D + = N; }

下面是结果。我想我最初宣称TAB1比CTZ快3倍仅持有对大K(我在那里测试)。马克的环比我原来的code快,但摆脱了 COPY2 宏分公司花费蛋糕K> 8。

  K基地CTZ马克TAB1 TAB2
001 4.97s 6.42s 6.66s 18.23s 12.77s
002 4.95s 8.49s 7.28s 19.50s 12.33s
004 4.95s 9.83s 8.68s 19.74s 11.92s
006 4.95s 16.86s 9.53s 20.48s 11.66s
008 4.95s 19.21s 13.87s 20.77s 11.92s
010 4.95s 21.53s 13.09s 21.02s 11.28s
015 4.95s 32.64s 17.75s 23.30s 10.98s
020 4.99s 42.00s 21.75s 27.15s 10.96s
030 5.00s 100.64s 35.48s 35.84s 11.07s
040 5.01s 131.96s 44.55s 44.51s 11.58s


解决方案

我认为关键在这里的表现是把重点放在更大的问题,而不是微观优化的位位置提取出一个随机整数。

你的样本code和previous来看SO质疑你列举,以便设置K比特的所有单词,并提取位索引出这些。这大大简化了问题。

如果这样的话,而不是重建位的位置每次迭代尝试直接递增位阵列中的位置。一半的时间,这将涉及一个循环迭代和增量。

沿着这些路线的内容:

  //通过与NUM位len个全位字按顺序设置走
无效枚举(为size_t NUM,为size_t的len){
    为size_t我;
    unsigned int类型BITPOS [64 + 1];    //种子以最低的字加上一个哨兵
    对于(i = 0; I&LT; NUM ++ I)
        BITPOS [我] =我;
    位位置[I] = 0;    //这里去主循环
    做{
        //做一些与所得到的数据
        流程(位位置,NUM);        //增加最少的显著系列连续位
        为(ⅰ= 0;位位置第[i + 1] == BITPOS [I] + 1 ++ⅰ)
            BITPOS [我] =我;
    //停止在到达顶部
    }而(++ BITPOS [I] = LEN!);
}//测试功能
无效的过程(const的无符号整数*位,为size_t NUM){
    做
        的printf(%D位[ - NUM]);
    而(NUM);
    的putchar('\\ n');
}

不是特别优化,但你得到的总体思路。

I need a fast way to get the position of all one bits in a 64-bit integer. For example, given x = 123703, I'd like to fill an array idx[] = {0, 1, 2, 4, 5, 8, 9, 13, 14, 15, 16}. We can assume we know the number of bits a priori. This will be called 10^12 - 10^15 times, so speed is of the essence. The fastest answer I've come up with so far is the following monstrosity, which uses each byte of the 64-bit integer as an index into tables that give the number of bits set in that byte and the positions of the ones:

int64_t x;            // this is the input
unsigned char idx[K]; // this is the array of K bits that are set
unsigned char *dst=idx, *src;
unsigned char zero, one, two, three, four, five;  // these hold the 0th-5th bytes
zero  =  x & 0x0000000000FFUL;
one   = (x & 0x00000000FF00UL) >> 8;
two   = (x & 0x000000FF0000UL) >> 16;
three = (x & 0x0000FF000000UL) >> 24;
four  = (x & 0x00FF00000000UL) >> 32;
five  = (x & 0xFF0000000000UL) >> 40;
src=tab0+tabofs[zero ]; COPY(dst, src, n[zero ]);
src=tab1+tabofs[one  ]; COPY(dst, src, n[one  ]);
src=tab2+tabofs[two  ]; COPY(dst, src, n[two  ]);
src=tab3+tabofs[three]; COPY(dst, src, n[three]);
src=tab4+tabofs[four ]; COPY(dst, src, n[four ]);
src=tab5+tabofs[five ]; COPY(dst, src, n[five ]);

where COPY is a switch statement to copy up to 8 bytes, n is array of the number of bits set in a byte and tabofs gives the offset into tabX, which holds the positions of the set bits in the X-th byte. This is about 3x faster than unrolled loop-based methods with __builtin_ctz() on my Xeon E5-2609. (See below.) I am currently iterating x in lexicographical order for a given number of bits set.

Is there a better way?

EDIT: Added an example (that I have subsequently fixed). Full code is available here: http://pastebin.com/79X8XL2P . Note: GCC with -O2 seems to optimize it away, but Intel's compiler (which I used to compose it) doesn't...

Also, let me give some additional background to address some of the comments below. The goal is to perform a statistical test on every possible subset of K variables out of a universe of N possible explanatory variables; the specific target right now is N=41, but I can see some projects needing N up to 45-50. The test basically involves factorizing the corresponding data submatrix. In pseudocode, something like this:

double doTest(double *data, int64_t model) {
  int nidx, idx[];
  double submatrix[][];
  nidx = getIndices(model, idx);  // get the locations of ones in model
  // copy data into submatrix
  for(int i=0; i<nidx; i++) {
    for(int j=0; j<nidx; j++) {
      submatrix[i][j] = data[idx[i]][idx[j]];
    }
  }
  factorize(submatrix, nidx);
  return the_answer;
}

I coded up a version of this for an Intel Phi board that should complete the N=41 case in about 15 days, of which ~5-10% of the time is spent in a naive getIndices() so right off the bat a faster version could save a day or more. I'm working on an implementation for NVidia Kepler too, but unfortunately the problem I have (ludicrous numbers of small matrix operations) is not ideally suited to the hardware (ludicrously large matrix operations). That said, this paper presents a solution that seems to achieve hundreds of GFLOPS/s on matrices of my size by aggressively unrolling loops and performing the entire factorization in registers, with the caveat that the dimensions of the matrix be defined at compile-time. (This loop unrolling should help reduce overhead and improve vectorization in the Phi version too, so getIndices() will become more important!) So now I'm thinking my kernel should look more like:

double *data;  // move data to GPU/Phi once into shared memory
template<unsigned int K> double doTestUnrolled(int *idx) {
  double submatrix[K][K];
  // copy data into submatrix
  #pragma unroll
  for(int i=0; i<K; i++) {
    #pragma unroll
    for(int j=0; j<K; j++) {
      submatrix[i][j] = data[idx[i]][idx[j]];
    }
  }
  factorizeUnrolled<K>(submatrix);
  return the_answer;
}

The Phi version solves each model in a `cilk_for' loop from model=0 to 2^N (or, rather, a subset for testing), but now in order to batch work for the GPU and amortize the kernel launch overhead I have to iterate model numbers in lexicographical order for each of K=1 to 41 bits set (as doynax noted).

EDIT 2: Now that vacation is over, here are some results on my Xeon E5-2602 using icc version 15. The code that I used to benchmark is here: http://pastebin.com/XvrGQUat. I perform the bit extraction on integers that have exactly K bits set, so there is some overhead for the lexicographic iteration measured in the "Base" column in the table below. These are performed 2^30 times with N=48 (repeating as necessary).

"CTZ" is a loop that uses the the gcc intrinsic __builtin_ctzll to get the lowest order bit set:

for(int i=0; i<K; i++) {
    idx[i] = __builtin_ctzll(tmp);
    lb = tmp & -tmp;    // get lowest bit
    tmp ^= lb;      // remove lowest bit from tmp
} 

Mark is Mark's branchless for loop:

for(int i=0; i<K; i++) {
    *dst = i;
    dst += x & 1;
    x >>= 1;
} 

Tab1 is my original table-based code with the following copy macro:

#define COPY(d, s, n) \
switch(n) { \
case 8: *(d++) = *(s++); \
case 7: *(d++) = *(s++); \
case 6: *(d++) = *(s++); \
case 5: *(d++) = *(s++); \
case 4: *(d++) = *(s++); \
case 3: *(d++) = *(s++); \
case 2: *(d++) = *(s++); \
case 1: *(d++) = *(s++); \
case 0: break;        \
}

Tab2 is the same code as Tab1, but the copy macro just moves 8 bytes as a single copy (taking ideas from doynax and Lưu Vĩnh Phúc... but note this does not ensure alignment):

#define COPY2(d, s, n) { *((uint64_t *)d) = *((uint64_t *)s); d+=n; }

Here are the results. I guess my initial claim that Tab1 is 3x faster than CTZ only holds for large K (where I was testing). Mark's loop is faster than my original code, but getting rid of the branch in the COPY2 macro takes the cake for K > 8.

 K    Base    CTZ   Mark   Tab1   Tab2
001  4.97s  6.42s  6.66s 18.23s 12.77s
002  4.95s  8.49s  7.28s 19.50s 12.33s
004  4.95s  9.83s  8.68s 19.74s 11.92s
006  4.95s 16.86s  9.53s 20.48s 11.66s
008  4.95s 19.21s 13.87s 20.77s 11.92s
010  4.95s 21.53s 13.09s 21.02s 11.28s
015  4.95s 32.64s 17.75s 23.30s 10.98s
020  4.99s 42.00s 21.75s 27.15s 10.96s
030  5.00s 100.64s 35.48s 35.84s 11.07s
040  5.01s 131.96s 44.55s 44.51s 11.58s

解决方案

I believe the key to performance here is to focus on the larger problem rather than on micro-optimizing the extraction of bit positions out of a random integer.

Judging by your sample code and previous SO question you are enumerating all words with K bits set in order, and extracting the bit indices out of these. This greatly simplifies matters.

If so then instead of rebuilding the bit position each iteration try directly incrementing the positions in the bit array. Half of the time this will involve a single loop iteration and increment.

Something along these lines:

// Walk through all len-bit words with num-bits set in order
void enumerate(size_t num, size_t len) {
    size_t i;
    unsigned int bitpos[64 + 1];

    // Seed with the lowest word plus a sentinel
    for(i = 0; i < num; ++i)
        bitpos[i] = i;
    bitpos[i] = 0;

    // Here goes the main loop
    do {
        // Do something with the resulting data
        process(bitpos, num);

        // Increment the least-significant series of consecutive bits
        for(i = 0; bitpos[i + 1] == bitpos[i] + 1; ++i)
            bitpos[i] = i;
    // Stop on reaching the top
    } while(++bitpos[i] != len);
}

// Test function
void process(const unsigned int *bits, size_t num) {
    do
        printf("%d ", bits[--num]);
    while(num);
    putchar('\n');
}

Not particularly optimized but you get the general idea.

这篇关于什么是返回一个64位整数所有设置位的位置的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆