快速搜索并替换INT [C一些蚕食; microoptimisation] [英] Fast search and replace some nibble in int [c; microoptimisation]
问题描述
这是<一个变种href=\"http://stackoverflow.com/questions/5187816/fast-search-of-some-nibbles-in-two-ints-at-same-offset-c-microoptimisation\">Fast搜索在相同的两个整数一些半字节偏移量(C,microoptimisation)具有不同的任务问题:
的任务是找到一个INT32 predefined蚕食,并与其他四位替换它。例如,四位以搜索为0x5的;四位用来替换是0xe:
The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:
int: 0x3d542753 (input)
^ ^
output:0x3dE427E3 (output int)
有可能是另一对四位以搜索和四位以替代(在编译时已知)。
There can be other pair of nibble to search and nibble to replace (known at compile time).
我检查了我的计划,这部分是最炎热的地方之一(gprof的证明,时间的75%是在功能);并且它被称为一个非常-非常多次(gcov的证明)。其实这是第三或嵌套循环的第四个环,具有运行计数估计的(N ^ 3)*(2 ^ n)的的,对于n = 18..24。
I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is the 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.
我目前的code是慢(我把它改写为功能,但是它是一个循环code):
My current code is slow (I rewrite it as function, but it is a code from loop):
static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
int i;
uint32_t mask = 0xf;
uint32_t search = 0x5;
uint32_t replace = 0xe;
for(i=0;i<8;i++) {
if( (A&mask) == search )
A = (A & (~mask) ) // clean i-th nibble
| replace; // and replace
mask <<= 4; search <<= 4; replace <<= 4;
}
return A;
}
是否有可能重写此功能和宏观并行的方式,使用一些位逻辑的魔力?魔术是像(T-0x11111111)及(〜T)-0x88888888
还可能与SSE *使用。检查链接的问题的接受的答案得到感慨一下需要魔法。
Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888
and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.
我的编译器是gcc452和CPU是Intel酷睿2个32位中的模式(86)或(在不久的将来)在64位模式(X86-64)。
My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).
推荐答案
这似乎是一个有趣的问题,所以我写了一个解决方案,不看其他的答案。这似乎是大约4.9x快我的系统上。在我的系统,它也比DigitalRoss的溶液(〜25%的速度)。
This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
我要解释它是如何工作的,但是......我想解释它败坏的乐趣。
I would explain how it works, but... I think explaining it spoils the fun.
这是SIMD 注意:这种东西是非常,非常容易量化。你甚至不必知道如何使用SSE或MMX。这是我如何向量化的:
Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:
static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
uint32_t i;
for (i = 0; i < n; ++i) {
uint32_t x = p[i];
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15;
p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
}
使用GCC,该功能将自动转换为SSE code。在 -O3
,假设正确使用 -march 的code>标记。你可以通过
-ftree-矢量化-详细= 2
海湾合作委员会要求它打印出哪些循环已矢量化的,例如:
Using GCC, this function will automatically be converted to SSE code at -O3
, assuming proper use of the -march
flag. You can pass -ftree-vectorizer-verbose=2
to GCC to ask it to print out which loops are vectorized, e.g.:
$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.
自动向量化给了我大约64%的额外速度增益,我甚至没有到达的处理器手册。
Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.
编辑:我从 uint32_t的
在自动量化版本变更类型 uint16_t
。这使总加速到超过原来的12倍左右。更改为 uint8_t有
导致矢量失败。我怀疑还有用手工装配找到一些显著额外的速度,如果它是非常重要的。
I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t
to uint16_t
. This brings the total speedup to about 12x over the original. Changing to uint8_t
causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.
编辑2:更改 * = 7
到 * = 15
,这个的失效速度测试。
Edit 2: Changed *= 7
to *= 15
, this invalidates the speed tests.
修改3:这里有一个变化是明显的在回顾:
Edit 3: Here's a change that is obvious in retrospect:
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
return x ^ (y * (SEARCH ^ REPLACE));
}
这篇关于快速搜索并替换INT [C一些蚕食; microoptimisation]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!