使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置 [英] Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

查看：137 发布时间：2020/5/21 20:28:50 c optimization x86 x86-64 simd

本文介绍了使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

(相关:

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.

此外，这是类似问题的AVX2版本，整个行中的许多容器都比一个uint64_t宽得多:

Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)

我正在C语言中的一个项目中，我需要遍历数百万个掩码(ulong类型(64位))，并更新基于64个短整数(uint16)的数组(称为target)在一个简单的规则上:

I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:

// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
    if (mask & (1ull << i)) {
        target[i]++
    }
}

问题是我需要在数以百万计的蒙版上执行上述循环，并且我需要在不到一秒钟的时间内完成.想知道是否有任何方法可以加快它的速度，例如使用某种表示上述循环的特殊汇编指令.

The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.

当前，我在ubuntu 14.04(i7-2670QM，支持AVX，不支持AVX2)上使用gcc 4.8.4来编译并运行以下代码，耗时约2秒钟.很想让它在200毫秒内运行.

Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];

int main(int argc, char *argv[]) {
    int i, j;
    unsigned long x = 123;
    unsigned long m = 1;
    char *p = malloc(8 * 10000000);
    if (!p) {
        printf("failed to allocate\n");
        exit(0);
    }
    memset(p, 0xff, 80000000);
    printf("p=%p\n", p);
    unsigned long *pLong = (unsigned long*)p;
    double start = getTS();
    for (j = 0; j < 10000000; j++) {
        m = 1;
        for (i = 0; i < 64; i++) {
            if ((pLong[j] & m) == m) {
                target[i]++;
            }
            m = (m << 1);
        }
    }
    printf("took %f secs\n", getTS() - start);
    return 0;
}

提前谢谢！

推荐答案

在我的系统上，一个装有clang-900.0.39.2 -O3的4年旧MacBook(2.7 GHz英特尔酷睿i5)，您的代码运行时间为500毫秒.

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.

只需将内部测试更改为if ((pLong[j] & m) != 0)即可节省30％的时间，运行时间为350ms.

Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.

无需测试即可将内部简化为target[i] += (pLong[j] >> i) & 1;，将其降低到280ms.

Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.

进一步的改进似乎需要更高级的技术，例如将这些位拆成8个ulong的块，并并行添加，一次处理255个ulong.

Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.

这是使用此方法的改进版本.它在我的系统上以45毫秒的速度运行.

Here is an improved version using this method. it runs in 45ms on my system.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate\n");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p\n", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }\n");
    printf("took %f secs\n", getTS() - start);
    return 0;
}

在答案中调查并解释了将字节扩展为64位长的技术: https://stackoverflow.com /a/55059914/4593267 .我将target数组和inflate数组都设置为局部变量，并打印结果以确保编译器不会优化计算.在生产版本中，您将分别计算inflate数组.

The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.

直接使用SIMD可能会带来进一步的改进，但会降低可移植性和可读性.这种优化通常最好交给编译器，因为它可以为目标体系结构生成特定的代码.除非性能至关重要，并且基准测试证明这是一个瓶颈，否则我将始终支持通用解决方案.

Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.

njuffa提供的另一种解决方案无需预先计算的阵列即可提供类似的性能.根据您的编译器和硬件的具体情况，它可能会更快.

A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

这篇关于使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置 [英] Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用AVX而不是AVX2，通过许多64位位掩码分别计算每个位的位置 [英] Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭