快速搜索并替换INT [C一些蚕食; microoptimisation] [英] Fast search and replace some nibble in int [c; microoptimisation]

查看:111
本文介绍了快速搜索并替换INT [C一些蚕食; microoptimisation]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是<一个变种href=\"http://stackoverflow.com/questions/5187816/fast-search-of-some-nibbles-in-two-ints-at-same-offset-c-microoptimisation\">Fast搜索在相同的两个整数一些半字节偏移量(C,microoptimisation)具有不同的任务问题:

的任务是找到一个INT32 predefined蚕食,并与其他四位替换它。例如,四位以搜索为0x5的;四位用来替换是0xe:

The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:

int:   0x3d542753 (input)
           ^   ^
output:0x3dE427E3 (output int)

有可能是另一对四位以搜索和四位以替代(在编译时已知)。

There can be other pair of nibble to search and nibble to replace (known at compile time).

我检查了我的计划,这部分是最炎热的地方之一(gprof的证明,时间的75%是在功能);并且它被称为一个非常-非常多次(gcov的证明)。其实这是第三或嵌套循环的第四个环,具有运行计数估计的(N ^ 3)*(2 ^ n)的的,对于n = 18..24。

I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is the 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.

我目前的code是慢(我把它改写为功能,但是它是一个循环code):

My current code is slow (I rewrite it as function, but it is a code from loop):

static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
  int i;
  uint32_t mask = 0xf;
  uint32_t search = 0x5;
  uint32_t replace = 0xe;
  for(i=0;i<8;i++) {
    if( (A&mask) == search ) 
        A = (A & (~mask) )   // clean i-th nibble
           | replace;        // and replace
    mask <<= 4; search <<= 4; replace <<= 4;
  }
  return A;
}

是否有可能重写此功能和宏观并行的方式,使用一些位逻辑的魔力?魔术是像(T-0x11111111)及(〜T)-0x88888888 还可能与SSE *使用。检查链接的问题的接受的答案得到感慨一下需要魔法。

Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888 and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.

我的编译器是gcc452和CPU是Intel酷睿2个32位中的模式(86)或(在不久的将来)在64位模式(X86-64)。

My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).

推荐答案

这似乎是一个有趣的问题,所以我写了一个解决方案,不看其他的答案。这似乎是大约4.9x快我的系统上。在我的系统,它也比DigitalRoss的溶液(〜25%的速度)。

This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
    return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}

我要解释它是如何工作的,但是......我想解释它败坏的乐趣。

I would explain how it works, but... I think explaining it spoils the fun.

这是SIMD 注意:这种东西是非常,非常容易量化。你甚至不必知道如何使用SSE或MMX。这是我如何向量化的:

Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:

static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
    uint32_t i;
    for (i = 0; i < n; ++i) {
        uint32_t x = p[i];
        uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
        uint32_t y = (~(ONES * SEARCH)) ^ x;
        y &= y >> 2;
        y &= y >> 1;
        y &= ONES;
        y *= 15;
        p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
    }
}

使用GCC,该功能将自动转换为SSE ​​code。在 -O3 ,假设正确使用 -march 标记。你可以通过 -ftree-矢量化-详细= 2 海湾合作委员会要求它打印出哪些循环已矢量化的,例如:

Using GCC, this function will automatically be converted to SSE code at -O3, assuming proper use of the -march flag. You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print out which loops are vectorized, e.g.:

$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.

自动向量化给了我大约64%的额外速度增益,我甚至没有到达的处理器手册。

Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.

编辑:我从 uint32_t的在自动量化版本变更类型 uint16_t 。这使总加速到超过原来的12倍左右。更改为 uint8_t有导致矢量失败。我怀疑还有用手工装配找到一些显著额外的速度,如果它是非常重要的。

I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t to uint16_t. This brings the total speedup to about 12x over the original. Changing to uint8_t causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.

编辑2:更改 * = 7 * = 15 ,这个的失效速度测试。

Edit 2: Changed *= 7 to *= 15, this invalidates the speed tests.

修改3:这里有一个变化是明显的在回顾:

Edit 3: Here's a change that is obvious in retrospect:

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    return x ^ (y * (SEARCH ^ REPLACE));
}

这篇关于快速搜索并替换INT [C一些蚕食; microoptimisation]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆