快速实现大型整数计数器(在C / C ++中) [英] Fast implementation of a large integer counter (in C/C++)

查看:96
本文介绍了快速实现大型整数计数器(在C / C ++中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标如下:

生成连续值,这样就不会生成每个新值,直到生成所有可能的值为止。此时,计数器再次开始相同的序列。这里的要点是,所有可能的值是无重复地生成的()(直到周期耗尽)。顺序是简单的0、1、2、3,...还是其他顺序都没关系。

Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.

例如,是否可以表示范围只需通过 unsigned ,然后

For example, if the range can be represented simply by an unsigned, then

void increment (unsigned &n) {++n;}

就足够了。但是,整数范围大于64位。例如,在一个地方,我需要生成256位序列。一个简单的实现如下所示,只是为了说明我要做什么,

is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,

typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
    if (ctr[0] < max) {++ctr[0]; return;}
    if (ctr[1] < max) {++ctr[1]; return;}
    if (ctr[2] < max) {++ctr[2]; return;}
    if (ctr[3] < max) {++ctr[3]; return;}
    ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}

因此,如果 ctr 从全零开始,然后首先将 ctr [0] 递增一个,直到达到 max ,然后 ctr [1] ,依此类推。如果设置了所有256位,则我们将其重置为全零,然后重新开始。

So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.

问题是,这种实现的速度出奇地慢。我当前的改进版本等效于以下内容,

The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,

void increment (ctr_type &ctr)
{
    std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
    if (k < 4)
        ++ctr[k];
    else
        memset(ctr.data(), 0, 32);

}

如果仅使用上述<$ c操作计数器$ c>递增函数,并且始终从零开始,然后 ctr [k] == 0 如果 ctr [k -1] == 0 。因此,值 k 将是小于最大值的第一个元素的索引。

If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.

我期望第一个要更快,因为分支错误预测将在每2 ^ 64次迭代中仅发生一次。第二个错误预测虽然仅每2 ^ 256次迭代发生一次,但不会有任何不同。除分支外,还需要四个按位取反,四个布尔取反和三个加法运算。

I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.

但是, clang gcc 或intel icpc 生成的二进制文件第二个要快得多。

However, both clang, gcc, or intel icpc generate binaries that the second was much faster.

我的主要问题有人知道是否有更快的方法来实现这种计数器吗?只要该算法生成所有2 ^ 256个256位组合,该计数器是否以增加第一个整数开始还是将其实现为整数数组都没有关系。

My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.

是什么使事情变得更复杂,我还需要非均匀增量。例如,每次计数器增加 K ,其中 K> 1 ,但几乎始终保持不变。我当前的实现与上述类似。

What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.

为了提供更多上下文,我使用计数器的一个地方是将它们用作AES-NI aesenc指令的输入。如此不同的128位整数(加载到 __ m128i 中),经过10次(或12或14个,取决于密钥大小)轮回指令后,一个独特的<$生成c $ c> 128位整数。如果我一次生成一个 __ m128i 整数,那么增量的成本几乎没有关系。但是,由于aesenc有很多延迟,因此我按块生成整数。例如,我可能有4个块, ctr_type块[4] ,其初始化等效于以下内容,

To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,

block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);

每当我需要新的输出时,我增量每个 block [i] 减4,并一次生成4个 __ m128i 输出。通过使用指令交织,总体上,当使用2个64位整数作为计数器和8个块时,我能够提高吞吐量,并将每字节输出(cpB)的周期从6减少到0.9。但是,如果改为使用4个32位整数作为计数器,则以字节/秒为单位的吞吐量将降低为一半。我知道一个事实,在x86-64上,在某些情况下64位整数可能比32位更快。但是我没想到如此简单的增量操作会带来如此大的变化。我已经仔细地对应用程序进行了基准测试,而 increment 的确是降低程序速度的一个原因。由于加载到 __ m128i 并将 __ m128i 输出存储到可用的32位或64位整数中是通过对齐来完成的指针,则32位和64位版本之间的唯一区别是计数器的递增方式。我希望在将整数加载到 __ m128i 之后,期望的AES-NI将主导性能。但是,当使用4或8个块时,显然不是这种情况。

And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.

因此,总而言之,我的主要问题是,如果有人知道一种改进上述计数器的方法

So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

推荐答案

这不仅很慢,而且是不可能的。宇宙的总能量不足以进行2 ^ 256位的更改。

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.

优化之前的下一步是修复原始实现

Next thing before optimization is to fix the original implementation

void increment (ctr_type &ctr)
{
    if (++ctr[0] != 0) return;
    if (++ctr[1] != 0) return;
    if (++ctr[2] != 0) return;
    ++ctr[3];
}

如果每个 ctr [i] 不允许溢出到零,周期仅为4 *(2 ^ 32),如 0-9 19 ,29,39,49,... 99 199,299,... 1999,2999,3999, ...,9999

If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.

作为对评论的答复-第一次溢出需要2 ^ 64次迭代。由于慷慨,在一秒钟内最多可以进行2 ^ 32次迭代,这意味着该程序应运行2 ^ 32秒才能执行第一次。这大约是136年。

As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.

EDIT

如果原始实现带有2 ^ 66个州确实是我们想要的,然后建议将界面和功能更改为:

If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:

  (*counter) += 1;
  while (*counter == 0)
  {
     counter++;  // Move to next word
     if (counter > tail_of_array) {
        counter = head_of_array;
        memset(counter,0, 16);
        break;
     }
  }

重点是,溢出仍然很少发生。几乎总是只有一个单词要增加。

The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

这篇关于快速实现大型整数计数器(在C / C ++中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆