在未对齐的字节边界上有效打包10位数据 [英] Efficiently packing 10-bit data on unaligned byte boundries

查看:43
本文介绍了在未对齐的字节边界上有效打包10位数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对不与字节边界对齐的倍数进行一些位压缩.这就是我想做的.

I'm trying to do some bit packing on multiples that don't align to byte boundries. Here's specifically what I'm trying to do.

我有一个512位数据数组(8个64位整数).在该数组内部是10位数据,对齐2个字节.我需要做的是将10位数据的512位降为320位(5个64位整数).

I have a 512-bit array (8 64-bit integers) of data. Inside that array is 10-bit data aligned to 2 bytes. What I need to do is strip that 512-bits down to 320-bits of just the 10-bit data (5 64-bit integers).

我可以想到手动的方法,即遍历512位数组的每个2字节节,屏蔽掉10位,或者将字节边界考虑在内,然后创建输出64-位整数.像这样的东西:

I can think of the manual way to do that where I go through each 2-byte sections of the 512-bit array, mask out the 10-bits, or it together taking into account byte boundaries and create the output 64-bit integers. something like this:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
    array320bits[0] = (array512bits[0] & maskFor10bits) | ((array512bits[0] & (maskFor10bits << 16)) << 10) | 
                  ((array512bits[0] & (maskFor10bits << 32)) << 20) | ((array512bits[0] << 48) << 30) | 
                  ((arrayFor512bits[1] & (maskFor10bits)) << 40) | ((arrayFor512bits[1] & (maskFor10bits << 16)) << 50) |
                  ((arrayFor512bits[1] & (0xF << 32)) << 60);
    array320bits[1] = 0;
    array320bits[2] = 0;
    array320bits[3] = 0;
    array320bits[4] = 0;
}

我知道这是可行的,但是似乎容易出错,并且不容易扩展为更大的字节序列.

I know this will work but it seems error prone and doesn't easily expand to larger byte sequences.

或者,我可以遍历输入数组,将所有10位值剥离为一个向量,然后在最后将它们连接起来,再次确保我对齐到字节边界.像这样:

Alternatively I could go through the input array, strip out all the 10 bit values into a vector, then concatenate them at the end, again making sure I align to byte boundaries. Something like this:

void pack512to320bits(uint64 (&array512bits)[8], uint64 (&array320bits)[5])
{
    static uint64 maskFor10bits = 0x3FF;
    std::vector<uint16> maskedPixelBytes(8 * 4);

    for (unsigned int qword = 0; qword < 8; ++qword)
    {
        for (unsigned int pixelBytes = 0; pixelBytes < 4; ++pixelBytes)
        {
        maskedPixelBytes[qword * 4 + pixelBytes] = (array512bits[qword] & (maskFor10bits << (16 * pixelbytes)));
        } 
    }
    array320bits[0] = maskedPixelBytes[0] | (maskedPixelBytes[1] << 10) | (maskedPixelBytes[2] << 20) | (maskedPixelBytes[3] << 30) |
                  (maskedPixelBytes[4] << 40) | (maskedPixelBytes[5] << 50) | (maskedPixelBytes[6] << 60);
    array320bits[1] = (maskedPixelBytes[6] >> 4) | (maskedPixelBytes[7] << 6) ...


    array320bits[2] = 0;
    array320bits[3] = 0;
    array320bits[4] = 0;
}

这种方式更易于调试/读取,但效率低下,并且无法再次扩展为更大的字节序列.我想知道是否有一种更简单/算法的方法来进行这种位打包.

This way is a little easier to debug/read but is inefficient and again can't be expanded to larger byte sequences. I'm wondering if there is an easier/algorithmic way to do this sort of bit packing.

推荐答案

您可以完成您想做的事,但这取决于某些条件和您认为有效的事.

What you want can be done, but it depends on certain conditions and what you consider efficient.

首先,如果2个数组将始终为1 512位和1 320位数组,也就是说,如果要传递的数组将始终为 uint64(& array512bits)[8] uint64(& array320bits)[5] ,那么对填充进行硬编码实际上要高出几个数量级.

First, if the 2 arrays will always be 1 512-bit and 1 320-bit array, that is, if the arrays being passed will always be uint64 (&array512bits)[8] and uint64 (&array320bits)[5], then it's actually orders of magnitude more efficient to hard code the padding.

但是,如果您想考虑较大的字节序列,则可以创建一种算法,该算法考虑填充并相应地移去位,然后遍历 uint64 较大位数组的值.但是,采用这种方法时,会在程序集中引入增加计算时间的分支(例如 if(total_shifted< bit_size)等).即使进行了优化,生成的程序集也将比手动执行移位更复杂,并且执行此操作的代码将需要考虑每个数组的大小,以确保它们可以适当地相互适应,从而增加更多的计算量.时间(或一般代码复杂度).

If you wanted to take larger byte sequences into account though, you could create an algorithm that takes the padding into account and shifts off the bits accordingly then iterating through the uint64 values of the larger bit array. Going with this method however, introduces branches in the assembly that add computational time (e.g. if (total_shifted < bit_size), etc.). Even with optimizations on, the generated assembly would still be more complex than manually doing the shifts, as well, the code to do this would need to account for the size of each array to ensure they can fit into each other appropriately thus adding more compute time (or general code complexity).

例如,请考虑以下手动换档代码:

As an example, consider this manual shift code:

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
    a320[0] = (
        (a512[0] & 0x00000000000003FF)         | // 10 -> 10
        ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
        ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
        ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
        ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
        ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
        ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64

    a320[1] = (
        ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
        ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
        ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
        ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
        ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
        ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
        ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64

    a320[2] = (
        ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
        ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
        ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
        ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
        ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
        ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
        ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
        ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64

    a320[3] = (
        ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
        ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
        ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
        ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
        ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
        ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
        ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64

    a320[4] = (
        ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
        ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
        ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
        ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
        ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
        ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
        ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
}

此代码仅接受 uint64 类型的数组,这些数组将在考虑10位边界的情况下相互适应,并相应地进行移位,以使512位数组打包到320-位数组,因此执行类似 uint64 * a512p = a512;的操作;pack512to320_manual(a512p,a320); 在编译时将失败,因为 a512p 不是 uint64(&)[8] (即类型安全).请注意,此代码已完全展开以显示移位序列,但是您可以使用 #define enum 来避免幻数"并使代码潜在更清晰.

This code will only accept arrays of uint64 types that will fit into each other with the 10-bit boundary taken into account and shifts accordingly such that the 512-bit array is packed into the 320-bit array, so doing something like uint64* a512p = a512; pack512to320_manual(a512p, a320); will fail at compile time since a512p is not a uint64 (&)[8] (i.e. type-safety). Note that this code is expanded fully to show the bit shifting sequences, but you could use #define's or an enum to avoid "magic numbers" and make the code potentially clearer.

如果您想扩展它以考虑更大的字节序列,则可以执行以下操作:

If you wanted to expand this to take larger byte sequences into account, you could do something like the following:

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    const uint64* start = array512bits;
    const uint64* end = array512bits + (X-1);
    uint64 tmp = *start;
    uint64 tmask = 0;
    int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
    bool excess = false;
    while (start <= end) {
        while (stot < bit_size) {
            array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
            tot += 10; // increase shift left by 10 bits
            tmp = tmp >> 16; // shift off 2 bytes
            stot += 16; // increase shifted total
            if ((excess = ((tot + 10) >= bit_size))) { break; }
        }
        if (stot == bit_size) {
            tmp = *(++start); // get next value
            stot = 0;
        }
        if (excess) {
            rem = (bit_size - tot); // remainder bits to shift off
            tot = 0;
            // create the mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            // get the last bits
            array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
            // shift off and adjust
            tmp = tmp >> rem;
            rem = (10 - rem);
            // new mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            array320bits[i] = (tmp & tmask);

            tot += rem; // increase shift left by remainder bits
            tmp = tmp >> (rem + 6); // shift off 2 bytes
            stot += 16;
            excess = false;
        }
    }
}

此代码还考虑了字节边界,并将其打包到512位数组中.但是,此代码不会进行任何错误检查以确保大小正确匹配,因此如果 X%8!= 0 Y%5!=0 (其中 X Y > 0),可能会导致无效结果!此外,由于涉及循环,临时和转换,它比手动版本慢得多,而且,初次阅读功能代码的人可能需要花费更多时间来解析循环代码的全部意图和上下文.的版本.

This code also takes the byte boundaries into account and packs them into the 512-bit array. This code, however, does not do any error checking to ensure the sizes will properly match, so if X % 8 != 0 and Y % 5 != 0 (where X and Y > 0), you could get invalid results! Additionally, it's much slower than the manual version due to the looping, temporaries and shifting involved, as well, it could take more time for a first time reader of the function code to decipher the full intent and context of the loop code vs. that of the bit-shifting version.

如果要在两者之间添加一些东西,可以使用手动打包功能,并以8和5为一组在较大的字节数组上进行迭代,以确保字节正确对齐;类似于以下内容:

If you want something in-between the two, you could use the manual packing function and iterate over the larger byte arrays in groups of 8 and 5 to ensure the bytes align properly; something similar to the following:

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
        // handle invalid sizes how you need here
        std::cerr << "Invalid sizes!" << std::endl;
        return;
    }
    uint64* a320 = array320bits;
    const uint64* end = array512bits + (X-1);
    for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
        *a320 = (
            (a512[0] & 0x00000000000003FF)         | // 10 -> 10
            ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
            ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
            ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
            ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
            ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
            ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64
        ++a320;

        *a320 = (
            ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
            ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
            ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
            ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
            ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
            ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
            ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64
        ++a320;

        *a320 = (
            ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
            ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
            ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
            ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
            ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
            ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
            ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
            ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64
        ++a320;

        *a320 = (
            ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
            ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
            ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
            ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
            ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
            ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
            ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64
        ++a320;

        *a320 = (
            ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
            ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
            ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
            ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
            ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
            ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
            ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
        ++a320;
    }
}

这类似于手动打包功能,只增加了很少的检查时间,但可以处理较大的数组,这些数组将彼此干净地打包在一起(再次展开以显示序列).

This is similar to the manual packing function and only adds a trivial amount of time for the checks but can handle larger arrays that will pack into each other cleanly (again, expanded to show the sequence).

在i7@2.2GHz上使用 -O3 g ++ 4.2.1 用作上述示例的时间平均如下:

Timing the examples above with g++ 4.2.1 using -O3 on an i7@2.2GHz yielded these average times:

pack512to320_loop :0.135美元

pack512to320_manual :0.0017美元

pack512to320_manual_loop :0.0020美元

这是用于测试输入/输出和一般时序的测试代码:

And here is the test code used to test the input/output and general timing:

#include <iostream>
#include <ctime>
#if defined(_MSC_VER)
    #include <cstdint>
    #include <windows.h>
    #define timesruct LARGE_INTEGER
    #define dotick(v) QueryPerformanceCounter(&v)
    timesruct freq;
#else
    #define timesruct struct timespec
    #define dotick(v) clock_gettime(CLOCK_MONOTONIC, &v)
#endif

static const std::size_t bit_size = sizeof(uint64) * 8;

template < std::size_t X, std::size_t Y >
static void pack512to320_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    const uint64* start = array512bits;
    const uint64* end = array512bits + (X-1);
    uint64 tmp = *start;
    uint64 tmask = 0;
    int i = 0, tot = 0, stot = 0, rem = 0, z = 0;
    bool excess = false;
    // this line is only here for validities sake,
    // it was commented out during testing for performance
    for (z = 0; z < Y; ++z) { array320bits[z] = 0; }
    while (start <= end) {
        while (stot < bit_size) {
            array320bits[i] |= ((tmp & 0x00000000000003FF) << tot);
            tot += 10; // increase shift left by 10 bits
            tmp = tmp >> 16; // shift off 2 bytes
            stot += 16; // increase shifted total
            if ((excess = ((tot + 10) >= bit_size))) { break; }
        }
        if (stot == bit_size) {
            tmp = *(++start); // get next value
            stot = 0;
        }
        if (excess) {
            rem = (bit_size - tot); // remainder bits to shift off
            tot = 0;
            // create the mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            // get the last bits
            array320bits[i++] |= ((tmp & tmask) << (bit_size - rem));
            // shift off and adjust
            tmp = tmp >> rem;
            rem = (10 - rem);
            // new mask
            tmask = 0;
            for (z = 0; z < rem; ++z) { tmask |= (1 << z); }
            array320bits[i] = (tmp & tmask);

            tot += rem; // increase shift left by remainder bits
            tmp = tmp >> (rem + 6); // shift off 2 bytes
            stot += 16;
            excess = false;
        }
    }
}

template < std::size_t X, std::size_t Y >
static void pack512to320_manual_loop(const uint64 (&array512bits)[X], uint64 (&array320bits)[Y])
{
    if (((X == 0) || (X % 8 != 0)) || ((Y == 0) || (Y % 5 != 0)) || ((X < Y) || (Y % X != Y))) {
        // handle invalid sizes how you need here
        std::cerr << "Invalid sizes!" << std::endl;
        return;
    }
    uint64* a320 = array320bits;
    const uint64* end = array512bits + (X-1);
    for (const uint64* a512 = array512bits; a512 < end; a512 += 8) {
        *a320 = (
            (a512[0] & 0x00000000000003FF)         | // 10 -> 10
            ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
            ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
            ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
            ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
            ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
            ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64
        ++a320;

        *a320 = (
            ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
            ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
            ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
            ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
            ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
            ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
            ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64
        ++a320;

        *a320 = (
            ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
            ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
            ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
            ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
            ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
            ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
            ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
            ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64
        ++a320;

        *a320 = (
            ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
            ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
            ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
            ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
            ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
            ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
            ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64
        ++a320;

        *a320 = (
            ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
            ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
            ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
            ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
            ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
            ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
            ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
        ++a320;
    }
}

static void pack512to320_manual(uint64 (&a512)[8], uint64 (&a320)[5])
{
    a320[0] = (
        (a512[0] & 0x00000000000003FF)         | // 10 -> 10
        ((a512[0] & 0x0000000003FF0000) >> 6)  | // 10 -> 20
        ((a512[0] & 0x000003FF00000000) >> 12) | // 10 -> 30
        ((a512[0] & 0x03FF000000000000) >> 18) | // 10 -> 40
        ((a512[1] & 0x00000000000003FF) << 40) | // 10 -> 50
        ((a512[1] & 0x0000000003FF0000) << 34) | // 10 -> 60
        ((a512[1] & 0x0000000F00000000) << 28)); // 4  -> 64

    a320[1] = (
        ((a512[1] & 0x000003F000000000) >> 36) | // 6  -> 6
        ((a512[1] & 0x03FF000000000000) >> 42) | // 10 -> 16
        ((a512[2] & 0x00000000000003FF) << 16) | // 10 -> 26
        ((a512[2] & 0x0000000003FF0000) << 10) | // 10 -> 36
        ((a512[2] & 0x000003FF00000000) << 4)  | // 10 -> 46
        ((a512[2] & 0x03FF000000000000) >> 2)  | // 10 -> 56
        ((a512[3] & 0x00000000000000FF) << 56)); // 8  -> 64

    a320[2] = (
        ((a512[3] & 0x0000000000000300) >> 8)  | // 2  -> 2
        ((a512[3] & 0x0000000003FF0000) >> 14) | // 10 -> 12
        ((a512[3] & 0x000003FF00000000) >> 20) | // 10 -> 22
        ((a512[3] & 0x03FF000000000000) >> 26) | // 10 -> 32
        ((a512[4] & 0x00000000000003FF) << 32) | // 10 -> 42
        ((a512[4] & 0x0000000003FF0000) << 26) | // 10 -> 52
        ((a512[4] & 0x000003FF00000000) << 20) | // 10 -> 62
        ((a512[4] & 0x0003000000000000) << 14)); // 2  -> 64

    a320[3] = (
        ((a512[4] & 0x03FC000000000000) >> 50) | // 8  -> 8
        ((a512[5] & 0x00000000000003FF) << 8)  | // 10 -> 18
        ((a512[5] & 0x0000000003FF0000) << 2)  | // 10 -> 28
        ((a512[5] & 0x000003FF00000000) >> 4)  | // 10 -> 38
        ((a512[5] & 0x03FF000000000000) >> 10) | // 10 -> 48
        ((a512[6] & 0x00000000000003FF) << 48) | // 10 -> 58
        ((a512[6] & 0x00000000003F0000) << 42)); // 6  -> 64

    a320[4] = (
        ((a512[6] & 0x0000000003C00000) >> 22) | // 4  -> 4
        ((a512[6] & 0x000003FF00000000) >> 28) | // 10 -> 14
        ((a512[6] & 0x03FF000000000000) >> 34) | // 10 -> 24
        ((a512[7] & 0x00000000000003FF) << 24) | // 10 -> 34
        ((a512[7] & 0x0000000003FF0000) << 18) | // 10 -> 44
        ((a512[7] & 0x000003FF00000000) << 12) | // 10 -> 54
        ((a512[7] & 0x03FF000000000000) << 6));  // 10 -> 64
}

template < std::size_t N >
static void printit(uint64 (&arr)[N])
{
    for (std::size_t i = 0; i < N; ++i) {
        std::cout << "arr[" << i << "] = " << arr[i] << std::endl;
    }
}

static double elapsed_us(timesruct init, timesruct end)
{
    #if defined(_MSC_VER)
        if (freq.LowPart == 0) { QueryPerformanceFrequency(&freq); }
        return (static_cast<double>(((end.QuadPart - init.QuadPart) * 1000000)) / static_cast<double>(freq.QuadPart));
    #else
        return ((end.tv_sec - init.tv_sec) * 1000000) + (static_cast<double>((end.tv_nsec - init.tv_nsec)) / 1000);
    #endif
}

int main(int argc, char* argv[])
{
    uint64 val = 0x039F039F039F039F;
    uint64 a512[] = { val, val, val, val, val, val, val, val };
    uint64 a320[] = { 0, 0, 0, 0, 0 };
    int max_cnt = 1000000;
    timesruct init, end;
    std::cout << std::hex;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_loop(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_manual(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    dotick(init);
    for (int i = 0; i < max_cnt; ++i) {
        pack512to320_manual_loop(a512, a320);
    }
    dotick(end);
    printit(a320);
    // rough estimate of timing / divide by iterations
    std::cout << "avg. us = " << (elapsed_us(init, end) / max_cnt) << " us" << std::endl;

    return 0;
}

同样,这只是通用测试代码,您的结果可能会有所不同.

Again, this is just generic test code and your results might vary.

希望可以提供帮助.

这篇关于在未对齐的字节边界上有效打包10位数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆