在许多 64 位位掩码上分别计算每个位位置,使用 AVX 而不是 AVX2 [英] Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

查看:35
本文介绍了在许多 64 位位掩码上分别计算每个位位置,使用 AVX 而不是 AVX2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(相关:如何在 Sandy Bridge 上的一系列整数中快速将位计数到单独的 bin 中? 是此问题的早期副本,有一些不同的答案.编者注:这里的答案可能更好.

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.

此外,还有一个类似问题的 AVX2 版本,一整排位的多个 bin 比一个宽得多 uint64_t:改进列人口计数算法)

Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)

我正在用 C 语言处理一个项目,我需要通过数千万个掩码(ulong 类型(64 位))并更新一个 64 短的数组(称为 target)基于简单规则的整数 (uint16):

I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:

// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
    if (mask & (1ull << i)) {
        target[i]++
    }
}

问题是我需要在数千万个面具上进行上述循环,而且我需要在不到一秒的时间内完成.想知道是否有任何方法可以加快速度,例如使用某种表示上述循环的特殊汇编指令.

The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.

目前我在 ubuntu 14.04(i7-2670QM,支持 AVX,不支持 AVX2)上使用 gcc 4.8.4 编译并运行以下代码,耗时约 2 秒.希望让它在 200 毫秒内运行.

Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];

int main(int argc, char *argv[]) {
    int i, j;
    unsigned long x = 123;
    unsigned long m = 1;
    char *p = malloc(8 * 10000000);
    if (!p) {
        printf("failed to allocate
");
        exit(0);
    }
    memset(p, 0xff, 80000000);
    printf("p=%p
", p);
    unsigned long *pLong = (unsigned long*)p;
    double start = getTS();
    for (j = 0; j < 10000000; j++) {
        m = 1;
        for (i = 0; i < 64; i++) {
            if ((pLong[j] & m) == m) {
                target[i]++;
            }
            m = (m << 1);
        }
    }
    printf("took %f secs
", getTS() - start);
    return 0;
}

提前致谢!

推荐答案

在我的系统上,使用 clang-900.0.39.2 -O3 的 4 年旧 MacBook(2.7 GHz intel core i5),你的代码运行时间为 500 毫秒.

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.

只需将内部测试更改为 if ((pLong[j] & m) != 0) 节省 30%,运行时间为 350 毫秒.

Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.

将内部进一步简化为target[i] += (pLong[j] >> i) &1; 没有经过测试将其降低到 280 毫秒.

Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.

进一步的改进似乎需要更先进的技术,例如将位解包为 8 个 ulong 的块并并行添加,一次处理 255 个 ulong.

Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.

这是使用此方法的改进版本.它在我的系统上运行 45 毫秒.

Here is an improved version using this method. it runs in 45ms on my system.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate
");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p
", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }
");
    printf("took %f secs
", getTS() - start);
    return 0;
}

在答案中调查并解释了将字节膨胀为 64 位长的技术:https://stackoverflow.com/a/55059914/4593267 .我将 target 数组和 inflate 数组设为局部变量,并打印结果以确保编译器不会优化计算.在生产版本中,您将单独计算 inflate 数组.

The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://stackoverflow.com/a/55059914/4593267 . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.

直接使用 SIMD 可能会以牺牲可移植性和可读性为代价提供进一步的改进.这种优化通常最好留给编译器,因为它可以为目标架构生成特定的代码.除非性能至关重要并且基准测试证明这是一个瓶颈,否则我总是倾向于通用解决方案.

Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.

njuffa 的不同解决方案提供了类似的性能,而无需预先计算的数组.根据您的编译器和硬件特性,它可能会更快.

A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

这篇关于在许多 64 位位掩码上分别计算每个位位置,使用 AVX 而不是 AVX2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆