如何有效地应用按位操作(大包装)位向量? [英] How to effectively apply bitwise operation to (large) packed bit vectors?

查看:144
本文介绍了如何有效地应用按位操作(大包装)位向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实施

void bitwise_and(
    char*       __restrict__  result,
    const char* __restrict__  lhs,
    const char* __restrict__  rhs,
    size_t                    length);

,或者一个bitwise_or(),bitwise_xor()或任何其它位运算。显然,这是不是算法,只是实施细则 - 对齐方式,从装入内存,缓存意识尽可能大的元素,使用SIMD指令等。

or maybe a bitwise_or(), bitwise_xor() or any other bitwise operation. Obviously it's not about the algorithm, just the implementation details - alignment, loading the largest possible element from memory, cache-awareness, using SIMD instructions etc.

我敢肯定,这有(多个)快速现有的实现,但我估计大部分库实现需要一些花哨的容器,例如这我GOOGLE的std :: bitset的或升压:: dynamic_bit_set - 但我显然不希望花时间建设其中的一个

I'm sure this has (more than one) fast existing implementations, but I would guess most library implementations would require some fancy container, e.g. std::bitset or boost::dynamic_bit_set which I've googled - but I obviously don't want to spend the time constructing one of those.

我也是......复制粘​​贴从现有的库?发现它可以'包'在内存中的原始包装位数组,一个不错的对象库?反正推出自己的实现?

So do I... Copy-paste from an existing library? Find a library which can 'wrap' a raw packed bits array in memory with a nice object? Roll my own implementation anyway?

注:


  • 我在C ++ code最感兴趣,但我肯定不介意纯C的方法。

  • 显然,使得输入数组的副本是出了问题 - 这大概会近乎一倍的执行时间

  • 我故意没有模板的位运算符,万一有对或某些特定优化,或等。

  • 在一次讨论多个向量操作,例如
  • 奖励积分V_OUT = V_1按位与V_2按位与V_3等。

  • 我注意到 比较库实现,但它是从5年前。我不能问哪个库使用,因为这将违反SO政策我想...

  • 如果它可以帮助你的,假设它的 uint64_t中真是让人不是字符 S(这并不重要 - 如果字符数组是不对齐的,我们可以只处理的标题,并分别尾随个字符)

  • I'm mostly interested in C++ code, but I certainly don't mind a plain C approach.
  • Obviously, making copies of the input arrays is out of the question - that would probably nearly-double the execution time.
  • I intentionally did not template the bitwise operator, in case there's some specific optimization for OR, or for AND etc.
  • Bonus points for discussing operations on multiple vectors at once, e.g. V_out = V_1 bitwise-and V_2 bitwise-and V_3 etc.
  • I noted this article comparing library implementations, but it's from 5 years ago. I can't ask which library to use since that would violate SO policy I guess...
  • If it helps you any, assume its uint64_ts rather than chars (that doesn't really matter - if the char array is unaligned we can just treated the heading and trailing chars separately).

推荐答案

这答案会假设你希望的最快的方式的并乐于使用的平台具体的东西。您优化的编译器可能能够从普通的C,但在我对面的experiance具体,因为这是最好的还是手写的。

This answer is going to assume you want the fastest possible way and are happy to use platform specific things. You optimising compiler may be able to produce similar code to the below from normal C but in my experiance across a few compilers something as specific as this is still best hand-written.

显然,像所有的优化任务,的从不假定的东西比较好/差,测量,测量,测量。

Obviously like all optimisation tasks, never assume anything is better/worse and measure, measure, measure.

如果你可以锁定你的建筑,至少SSE3你会做的x86:

If you could lock down you architecture to x86 with at least SSE3 you would do:

void bitwise_and(
    char*       result,
    const char* lhs,
    const char* rhs,
    size_t      length)
{
    while(length >= 16)
    {
        // Load in 16byte registers
        auto lhsReg = _mm_loadu_si128((__m128i*)lhs);
        auto rhsReg = _mm_loadu_si128((__m128i*)rhs);

        // do the op
        auto res = _mm_and_si128(lhsReg, rhsReg);

        // save off again
        _mm_storeu_si128((__m128i*)result, res);

        // book keeping
        length -= 16;
        result += 16;
        lhs += 16;
        rhs += 16;
    }

    // do the tail end. Assuming that the array is large the
    // most that the following code can be run is 15 times so I'm not
    // bothering to optimise. You could do it in 64 bit then 32 bit
    // then 16 bit then char chunks if you wanted...
    while (length)
    {
        *result = *lhs & *rhs;
        length -= 1;
        result += 1;
        lhs += 1;
        rhs += 1;
    }
}

这编译为每〜16个字节(+对剩余的和小的开销变化)。

This compiles to ~10asm instructions per 16 bytes (+ change for the leftover and a little overhead).

这样做这样的内部函数(超过手卷ASM)的伟大的事情是,编译器仍然可以自由地做更多的优化(如循环展开)ontop你写什么。它也处理寄存器分配。

The great thing about doing intrinsics like this (over hand rolled asm) is that the compiler is still free to do additional optimisations (such as loop unrolling) ontop of what you write. It also handles register allocation.

如果你能保证对齐的数据可以为您节省asm指令(而不是使用_mm_load_si128,编译器将足够聪明,以避免另一加载,并且使用它作为一个直接的纪念品操作数的'PAND。

If you could guarantee aligned data you could save an asm instruction (use _mm_load_si128 instead and the compiler will be clever enough to avoid a second load and use it as an direct mem operand to the 'pand'.

如果你能保证AVX2 +,那么你可以使用256位版本和处理10asm说明每32个字节。

If you could guarantee AVX2+ then you could use the 256 bit version and handle 10asm instructions per 32 bytes.

在胳膊那里有类似的NEON指令。

On arm theres similar NEON instructions.

如果你想要做多OPS只需添加相关的内在中间,它会每16个字节加1汇编指令。

If you wanted to do multiple ops just add the relevant intrinsic in the middle and it'll add 1 asm instruction per 16 bytes.

我是pretty确定你不需要任何额外的缓存控制一个体面的处理器。

I'm pretty sure with a decent processor you dont need any additional cache control.

这篇关于如何有效地应用按位操作(大包装)位向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆