快速搜索并替换INT [C一些蚕食; microoptimisation] [英] Fast search and replace some nibble in int [c; microoptimisation]

查看：111 发布时间：2016/8/21 20:33:50 c optimization micro-optimization

本文介绍了快速搜索并替换INT [C一些蚕食; microoptimisation]的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是<一个变种href=\"http://stackoverflow.com/questions/5187816/fast-search-of-some-nibbles-in-two-ints-at-same-offset-c-microoptimisation\">Fast搜索在相同的两个整数一些半字节偏移量（C，microoptimisation）具有不同的任务问题：

的任务是找到一个INT32 predefined蚕食，并与其他四位替换它。例如，四位以搜索为0x5的;四位用来替换是0xe：

The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:

int:   0x3d542753 (input)
           ^   ^
output:0x3dE427E3 (output int)

有可能是另一对四位以搜索和四位以替代（在编译时已知）。

There can be other pair of nibble to search and nibble to replace (known at compile time).

我检查了我的计划，这部分是最炎热的地方之一（gprof的证明，时间的75％是在功能）;并且它被称为一个非常-非常多次（gcov的证明）。其实这是第三或嵌套循环的第四个环，具有运行计数估计的（N ^ 3）*（2 ^ n）的的，对于n = 18..24。

I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is the 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.

我目前的code是慢（我把它改写为功能，但是它是一个循环code）：

My current code is slow (I rewrite it as function, but it is a code from loop):

static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
  int i;
  uint32_t mask = 0xf;
  uint32_t search = 0x5;
  uint32_t replace = 0xe;
  for(i=0;i<8;i++) {
    if( (A&mask) == search ) 
        A = (A & (~mask) )   // clean i-th nibble
           | replace;        // and replace
    mask <<= 4; search <<= 4; replace <<= 4;
  }
  return A;
}

是否有可能重写此功能和宏观并行的方式，使用一些位逻辑的魔力？魔术是像（T-0x11111111）及（〜T）-0x88888888 还可能与SSE *使用。检查链接的问题的接受的答案得到感慨一下需要魔法。

Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888 and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.

我的编译器是gcc452和CPU是Intel酷睿2个32位中的模式（86）或（在不久的将来）在64位模式（X86-64）。

My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).

推荐答案

这似乎是一个有趣的问题，所以我写了一个解决方案，不看其他的答案。这似乎是大约4.9x快我的系统上。在我的系统，它也比DigitalRoss的溶液（〜25％的速度）。

This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
    return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}

我要解释它是如何工作的，但是......我想解释它败坏的乐趣。

I would explain how it works, but... I think explaining it spoils the fun.

这是SIMD 注意：这种东西是非常，非常容易量化。你甚至不必知道如何使用SSE或MMX。这是我如何向量化的：

Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:

static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
    uint32_t i;
    for (i = 0; i < n; ++i) {
        uint32_t x = p[i];
        uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
        uint32_t y = (~(ONES * SEARCH)) ^ x;
        y &= y >> 2;
        y &= y >> 1;
        y &= ONES;
        y *= 15;
        p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
    }
}

使用GCC，该功能将自动转换为SSE code。在 -O3 ，假设正确使用 -march 标记。你可以通过 -ftree-矢量化-详细= 2 海湾合作委员会要求它打印出哪些循环已矢量化的，例如：


Using GCC, this function will automatically be converted to SSE code at -O3, assuming proper use of the -march flag.  You can pass -ftree-vectorizer-verbose=2 to GCC to ask it to print out which loops are vectorized, e.g.:
$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.

自动向量化给了我大约64％的额外速度增益，我甚至没有到达的处理器手册。
Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.

编辑：我从 uint32_t的在自动量化版本变更类型 uint16_t 。这使总加速到超过原来的12倍左右。更改为 uint8_t有导致矢量失败。我怀疑还有用手工装配找到一些显著额外的速度，如果它是非常重要的。

I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t to uint16_t. This brings the total speedup to about 12x over the original. Changing to uint8_t causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.

编辑2：更改 * = 7 到 * = 15 ，这个的失效速度测试。

Edit 2: Changed *= 7 to *= 15, this invalidates the speed tests.

修改3：这里有一个变化是明显的在回顾：

Edit 3: Here's a change that is obvious in retrospect:

static inline uint32_t nibble_replace_2(uint32_t x)
{
    uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
    uint32_t y = (~(ONES * SEARCH)) ^ x;
    y &= y >> 2;
    y &= y >> 1;
    y &= ONES;
    return x ^ (y * (SEARCH ^ REPLACE));
}

这篇关于快速搜索并替换INT [C一些蚕食; microoptimisation]的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

快速搜索并替换INT [C一些蚕食; microoptimisation] [英] Fast search and replace some nibble in int [c; microoptimisation]

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录关闭

快速搜索并替换INT [C一些蚕食; microoptimisation] [英] Fast search and replace some nibble in int [c; microoptimisation]

问题描述

推荐答案

相关文章

C/C++最新文章

热门教程

热门工具

登录 关闭

登录关闭