转换为0x1234到0x11223344 [英] Convert 0x1234 to 0x11223344

查看:465
本文介绍了转换为0x1234到0x11223344的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何十六进制数为0x1234扩大到0x11223344在高性能的方式?

How do I expand the hexadecimal number 0x1234 to 0x11223344 in a high-performance way?

unsigned int c = 0x1234, b;
b = (c & 0xff) << 4 | c & 0xf | (c & 0xff0) << 8
        | (c & 0xff00) << 12 | (c & 0xf000) << 16;
printf("%p -> %p\n", c, b);

输出:

0x1234 -> 0x11223344

我需要这个颜色转换。用户提供其数据形式0xARGB,我需要将其转换为 0xAARRGGBB 。是的,有可能是数以百万计,因为每一个可能是一个像素。 1000×1000像素等于一百万。

I need this for color conversion. Users provide their data in the form 0xARGB, and I need to convert it to 0xAARRGGBB. And yes, there could be millions, because each could be a pixel. 1000x1000 pixels equals to one million.

实际的情况更加复杂,因为一个32位值包含前景色和背景色。因此, 0xARGBargb 变成了: [0xAARRGGBB,0xaarrggbb]

The actual case is even more complicated, because a single 32-bit value contains both foreground and background colors. So 0xARGBargb become: [ 0xAARRGGBB, 0xaarrggbb ]

哦,是的,还有一件事,在实际应用中我也否定阿尔法,因为在的OpenGL 0xFF的是非透明和0x00为最透明的,这是在大多数情况下不方便,因为通常你只需要一个 RGB 部分和透明度被认为是非present。

Oh yes, one more thing, in a real application I also negate alpha, because in OpenGL 0xFF is non-transparent and 0x00 is most transparent, which is inconvenient in most cases, because usually you just need an RGB part and transparency is assumed to be non-present.

推荐答案

这可以通过使用SSE2做如下:

This can be done using SSE2 as follows:

void ExpandSSE2(unsigned __int64 in, unsigned __int64 &outLo, unsigned __int64 &outHi) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F);
  __m128i const mul0 = _mm_set1_epi16(0x0011);
  __m128i const mul1 = _mm_set1_epi16(0x1000);
  __m128i       v;

  v = _mm_cvtsi64_si128(in);    // move the 64-bit value to a 128-bit register
  v = _mm_unpacklo_epi8(v, v);  // 0x12   -> 0x1212
  v = _mm_and_si128(v, mask);   // 0x1212 -> 0x1002
  v = _mm_mullo_epi16(v, mul0); // 0x1002 -> 0x1022
  v = _mm_mulhi_epu16(v, mul1); // 0x1022 -> 0x0102
  v = _mm_mullo_epi16(v, mul0); // 0x0102 -> 0x1122

  outLo = _mm_extract_epi64(v, 0);
  outHi = _mm_extract_epi64(v, 1);
}

当然,你想要把函数的胆量在一个内部循环,并拉出常量。你也想​​直接跳过64位寄存器和负载值到128位SSE寄存器。有关如何做一个例子这是指在下面的性能测试SSE2执行。

Of course you’d want to put the guts of the function in an inner loop and pull out the constants. You will also want to skip the x64 registers and load values directly into 128-bit SSE registers. For an example of how to do this refer to the SSE2 implementation in the performance test below.

在其核心,有5个指令,这在时间上执行4颜色值的操作。所以,这是每个颜色值只有约1.25指令。还应当指出的是,SSE2是可在任何地方的x64是可用的。

At its core, there are 5 instructions, which perform the operation on 4 color values at a time. So, that is only about 1.25 instructions per color value. It should also be noted that SSE2 is available anywhere x64 is available.

作为解决方案的各式各样这里性能测试结果
有几个人提到,只有这样,才能知道什么是快是运行code,这是无可争议的事实。所以我编了几个解决方案,成为一个性能测试,所以我们可以比较苹果和苹果。我选择,我觉得和别人到需要测试显著不同的解决方案。从存储器读出的所有解决方案,操作上的数据,并写回存储器。在实践中,当有不另一个完整的16个字节在输入数据来处理一些证方案将需要在定位和处理情况格外小心。我测试了code是使用的x64上VS2013 4 + GHz的酷睿i7下运行发行版编译。

Performance tests for an assortment of the solutions here
A few people have mentioned that the only way to know what's faster is to run the code, this is unarguably true. So I've compiled a few of the solutions into a performance test so we can compare apples to apples. I chose solutions which I felt were significantly different from the others enough to require testing. All the solutions read from memory, operate on the data, and write back to memory. In practice some of the SSE solutions will require additional care around the alignment and handling cases when there aren't another full 16 bytes to process in the input data. The code I tested is x64 compiled under release using VS2013 running on a 4+GHz i7.

这里是我的结果:

ExpandOrig:               56.234 seconds  // from askers original question
ExpandSmallLUT:           30.209 seconds  // from Dmitry's answer
ExpandLookupSmallOneLUT:  33.689 seconds  // from Dmitry's answer
ExpandLookupLarge:        51.312 seconds  // a straightforward lookup table
ExpandAShelly:            43.829 seconds  // from AShelly's answer
ExpandAShellyMulOp:       43.580 seconds  // AShelly's answer with an optimization
ExpandSSE4:               17.854 seconds  // my original SSE4 answer
ExpandSSE4Unroll:         17.405 seconds  // my original SSE4 answer with loop unrolling
ExpandSSE2:               17.281 seconds  // my current SSE2 answer
ExpandSSE2Unroll:         17.152 seconds  // my current SSE2 answer with loop unrolling

在测试结果上面你会看到我包括提问者code,三查找表的实现,包括德米特里的回答提出的小查找表实现。 AShelly的溶液被包括太,以及与我提出的优化版本(操作可以消除)。我包括我原来的SSE4的实施,以及卓越的SSE2版本,后来我发(现体现为答案),以及中展开两种版本,因为他们是最快的在这里,我想看看有多少展开加速起来。我还包括了SSE4实现AShelly的回答。

In the test results above you'll see I included the askers code, three lookup table implementations including the small lookup table implementation proposed in Dmitry's answer. AShelly's solution is included too, as well as a version with an optimization I made (an operation can be eliminated). I included my original SSE4 implementation, as well as a superior SSE2 version I made later (now reflected as the answer), as well as unrolled versions of both since they were the fastest here and I wanted to see how much unrolling sped them up. I also included an SSE4 implementation of AShelly's answer.

到目前为止,我必须声明自己是赢家。但来源的下方,因此任何人都可以测试一下自己的平台上,包括自己的解决方案进入测试,看看他们做了一个解决方案,甚至更快。

So far I have to declare myself the winner. But the source is below, so anyone can test it out on their platform, and include their own solution into the testing to see if they've made a solution that's even faster.

#define DATA_SIZE_IN  ((unsigned)(1024 * 1024 * 128))
#define DATA_SIZE_OUT ((unsigned)(2 * DATA_SIZE_IN))
#define RERUN_COUNT   500
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <utility>
#include <emmintrin.h> // SSE2
#include <tmmintrin.h> // SSSE3
#include <smmintrin.h> // SSE4

void ExpandOrig(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u  =   (u & 0x00FF) << 4
         | (u & 0x000F)
         | (u & 0x0FF0) << 8
         | (u & 0xFF00) << 12
         | (u & 0xF000) << 16;
    v  =   (v & 0x00FF) << 4
         | (v & 0x000F)
         | (v & 0x0FF0) << 8
         | (v & 0xFF00) << 12
         | (v & 0xF000) << 16;

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

unsigned LutLo[256],
         LutHi[256];
void MakeLutLo(void) {
  for (unsigned i = 0, x; i < 256; ++i) {
    x        = i;
    x        = ((x & 0xF0) << 4) | (x & 0x0F);
    x       |= (x << 4);
    LutLo[i] = x;
  }
}
void MakeLutHi(void) {
  for (unsigned i = 0, x; i < 256; ++i) {
    x        = i;
    x        = ((x & 0xF0) << 20) | ((x & 0x0F) << 16);
    x       |= (x << 4);
    LutHi[i] = x;
  }
}

void ExpandLookupSmall(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = LutHi[u >> 8] | LutLo[u & 0xFF];
    v = LutHi[v >> 8] | LutLo[v & 0xFF];

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandLookupSmallOneLUT(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = ((LutLo[u >> 8] << 16) | LutLo[u & 0xFF]);
    v = ((LutLo[v >> 8] << 16) | LutLo[v & 0xFF]);

    // store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out + 4) = v;
    in  += 4;
    out += 8;
  } while (in != past);
}

unsigned LutLarge[256 * 256];
void MakeLutLarge(void) {
  for (unsigned i = 0; i < (256 * 256); ++i)
    LutLarge[i] = LutHi[i >> 8] | LutLo[i & 0xFF];
}

void ExpandLookupLarge(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = LutLarge[u];
    v = LutLarge[v];

    // store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandAShelly(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v, w, x;
  do {
    // read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    w  = (((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00);
    x  = (((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00);
    w += w * 0x10;
    x += x * 0x10;

    // store data
    *(unsigned*)(out)      = w;
    *(unsigned*)(out + 4)  = x;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandAShellyMulOp(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // do computation
    u = ((((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    v = ((((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;

    // store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out + 4) = v;
    in += 4;
    out += 8;
  } while (in != past);
}

void ExpandSSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0 = _mm_set1_epi16((short)0x8000),
                mask1 = _mm_set1_epi8(0x0F),
                mul = _mm_set1_epi16(0x0011);
  __m128i       u, v, w, x;
  do {
    // read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // do it again for v
    w = _mm_srli_epi16(u, 4);         // copy the value into w and shift it right half a byte
    x = _mm_srli_epi16(v, 4);         // do it again for v
    u = _mm_blendv_epi8(u, w, mask0); // select odd bytes from w, and even bytes from v, giving the the desired value in the upper nibble of each byte
    v = _mm_blendv_epi8(v, x, mask0); // do it again for v
    u = _mm_and_si128(u, mask1);      // clear the all the upper nibbles
    v = _mm_and_si128(v, mask1);      // do it again for v
    u = _mm_mullo_epi16(u, mul);      // multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v = _mm_mullo_epi16(v, mul);      // do it again for v

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

void ExpandSSE4Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0  = _mm_set1_epi16((short)0x8000),
                mask1  = _mm_set1_epi8(0x0F),
                mul    = _mm_set1_epi16(0x0011);
  __m128i       u0, v0, w0, x0,
                u1, v1, w1, x1,
                u2, v2, w2, x2,
                u3, v3, w3, x3;
  do {
    // read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in + 16));
    u2 = _mm_load_si128((__m128i const*)(in + 32));
    u3 = _mm_load_si128((__m128i const*)(in + 48));

    v0 = _mm_unpackhi_epi8(u0, u0);      // expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // do it again 
    u1 = _mm_unpacklo_epi8(u1, u1);      // again for u1
    v2 = _mm_unpackhi_epi8(u2, u2);      // again for v1
    u2 = _mm_unpacklo_epi8(u2, u2);      // again for u2
    v3 = _mm_unpackhi_epi8(u3, u3);      // again for v2
    u3 = _mm_unpacklo_epi8(u3, u3);      // again for u3
    w0 = _mm_srli_epi16(u0, 4);          // copy the value into w and shift it right half a byte
    x0 = _mm_srli_epi16(v0, 4);          // do it again for v
    w1 = _mm_srli_epi16(u1, 4);          // again for u1
    x1 = _mm_srli_epi16(v1, 4);          // again for v1
    w2 = _mm_srli_epi16(u2, 4);          // again for u2
    x2 = _mm_srli_epi16(v2, 4);          // again for v2
    w3 = _mm_srli_epi16(u3, 4);          // again for u3
    x3 = _mm_srli_epi16(v3, 4);          // again for v3
    u0 = _mm_blendv_epi8(u0, w0, mask0); // select even bytes from w, and odd bytes from v, giving the the desired value in the upper nibble of each byte
    v0 = _mm_blendv_epi8(v0, x0, mask0); // do it again for v
    u1 = _mm_blendv_epi8(u1, w1, mask0); // again for u1
    v1 = _mm_blendv_epi8(v1, x1, mask0); // again for v1
    u2 = _mm_blendv_epi8(u2, w2, mask0); // again for u2
    v2 = _mm_blendv_epi8(v2, x2, mask0); // again for v2
    u3 = _mm_blendv_epi8(u3, w3, mask0); // again for u3
    v3 = _mm_blendv_epi8(v3, x3, mask0); // again for v3
    u0 = _mm_and_si128(u0, mask1);       // clear the all the upper nibbles
    v0 = _mm_and_si128(v0, mask1);       // do it again for v
    u1 = _mm_and_si128(u1, mask1);       // again for u1
    v1 = _mm_and_si128(v1, mask1);       // again for v1
    u2 = _mm_and_si128(u2, mask1);       // again for u2
    v2 = _mm_and_si128(v2, mask1);       // again for v2
    u3 = _mm_and_si128(u3, mask1);       // again for u3
    v3 = _mm_and_si128(v3, mask1);       // again for v3
    u0 = _mm_mullo_epi16(u0, mul);       // multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v0 = _mm_mullo_epi16(v0, mul);       // do it again for v
    u1 = _mm_mullo_epi16(u1, mul);       // again for u1
    v1 = _mm_mullo_epi16(v1, mul);       // again for v1
    u2 = _mm_mullo_epi16(u2, mul);       // again for u2
    v2 = _mm_mullo_epi16(v2, mul);       // again for v2
    u3 = _mm_mullo_epi16(u3, mul);       // again for u3
    v3 = _mm_mullo_epi16(v3, mul);       // again for v3

    // write output
    _mm_store_si128((__m128i*)(out      ), u0);
    _mm_store_si128((__m128i*)(out +  16), v0);
    _mm_store_si128((__m128i*)(out +  32), u1);
    _mm_store_si128((__m128i*)(out +  48), v1);
    _mm_store_si128((__m128i*)(out +  64), u2);
    _mm_store_si128((__m128i*)(out +  80), v2);
    _mm_store_si128((__m128i*)(out +  96), u3);
    _mm_store_si128((__m128i*)(out + 112), v3);
    in  += 64;
    out += 128;
  } while (in != past);
}

void ExpandSSE2(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u, v;
  do {
    // read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // do it again for v

    u = _mm_and_si128(u, mask);
    v = _mm_and_si128(v, mask);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);
    u = _mm_mulhi_epu16(u, mul1);     // this can also be done with a right shift of 4 bits, but this seems to mesure faster
    v = _mm_mulhi_epu16(v, mul1);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

void ExpandSSE2Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u0, v0,
                u1, v1;
  do {
    // read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in + 16));

    v0 = _mm_unpackhi_epi8(u0, u0);      // expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // do it again 
    u1 = _mm_unpacklo_epi8(u1, u1);      // again for u1

    u0 = _mm_and_si128(u0, mask);
    v0 = _mm_and_si128(v0, mask);
    u1 = _mm_and_si128(u1, mask);
    v1 = _mm_and_si128(v1, mask);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    u0 = _mm_mulhi_epu16(u0, mul1);
    v0 = _mm_mulhi_epu16(v0, mul1);
    u1 = _mm_mulhi_epu16(u1, mul1);
    v1 = _mm_mulhi_epu16(v1, mul1);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u0);
    _mm_store_si128((__m128i*)(out + 16), v0);
    _mm_store_si128((__m128i*)(out + 32), u1);
    _mm_store_si128((__m128i*)(out + 48), v1);

    in  += 32;
    out += 64;
  } while (in != past);
}

void ExpandAShellySSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const zero      = _mm_setzero_si128(),
                v0F0F     = _mm_set1_epi32(0x0F0F),
                vF0F0     = _mm_set1_epi32(0xF0F0),
                v0101     = _mm_set1_epi32(0x0101),
                v1010     = _mm_set1_epi32(0x1010),
                v000F000F = _mm_set1_epi32(0x000F000F),
                v0F000F00 = _mm_set1_epi32(0x0F000F00),
                v0011 = _mm_set1_epi32(0x0011);
  __m128i       u, v, w, x;
  do {
    // read in data
    u = _mm_load_si128((__m128i const*)in);
    v = _mm_unpackhi_epi16(u, zero);
    u = _mm_unpacklo_epi16(u, zero);

    // original source: ((((a & 0xF0F) * 0x101) & 0xF000F) + (((a & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    w = _mm_and_si128(u, v0F0F);
    x = _mm_and_si128(v, v0F0F);
    u = _mm_and_si128(u, vF0F0);
    v = _mm_and_si128(v, vF0F0);
    w = _mm_mullo_epi32(w, v0101); // _mm_mullo_epi32 is what makes this require SSE4 instead of SSE2
    x = _mm_mullo_epi32(x, v0101);
    u = _mm_mullo_epi32(u, v1010);
    v = _mm_mullo_epi32(v, v1010);
    w = _mm_and_si128(w, v000F000F);
    x = _mm_and_si128(x, v000F000F);
    u = _mm_and_si128(u, v0F000F00);
    v = _mm_and_si128(v, v0F000F00);
    u = _mm_add_epi32(u, w);
    v = _mm_add_epi32(v, x);
    u = _mm_mullo_epi32(u, v0011);
    v = _mm_mullo_epi32(v, v0011);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

int main() {
  unsigned char *const indat   = new unsigned char[DATA_SIZE_IN ],
                *const outdat0 = new unsigned char[DATA_SIZE_OUT],
                *const outdat1 = new unsigned char[DATA_SIZE_OUT],
                *      curout  = outdat0,
                *      lastout = outdat1,
                *      place;
  unsigned             start,
                       stop;

  place = indat + DATA_SIZE_IN - 1;
  do {
    *place = (unsigned char)rand();
  } while (place-- != indat);
  MakeLutLo();
  MakeLutHi();
  MakeLutLarge();

  for (unsigned testcount = 0; testcount < 1000; ++testcount) {
    // solution posted by asker
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandOrig(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandOrig:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);

    // Dmitry's small lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupSmall(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSmallLUT:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // Dmitry's small lookup table solution using only one lookup table
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupSmallOneLUT(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupSmallOneLUT:\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // large lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupLarge(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupLarge:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShelly(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShelly:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution optimizing out an addition
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShellyMulOp(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellyMulOp:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE4 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE4(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE4 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE4Unroll(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE2 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE2(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // my SSE2 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE2Unroll(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution implemented using SSE2
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShellySSE4(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellySSE4:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;
  }

  delete[] indat;
  delete[] outdat0;
  delete[] outdat1;
  return 0;
}

注意:结果
我在这里最初的SSE4实现。我找到了一种方法来实现这一点使用SSE2,这是更好,因为它会在更多的平台上运行。该SSE2实施也比较快。这样,在顶部psented溶液$ P $现在是SSE2实施,而不是SSE4之一。该SSE4实施仍然可以看到在性能测试或编辑历史。

NOTE:
I had an SSE4 implementation here initially. I found a way to implement this using SSE2, which is better because it will run on more platforms. The SSE2 implementation is also faster. So, the solution presented at the top is now the SSE2 implementation and not the SSE4 one. The SSE4 implementation can still be seen in the performance tests or in the edit history.

这篇关于转换为0x1234到0x11223344的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆