关于无分支二进制搜索 [英] About the branchless binary search

查看:100
本文介绍了关于无分支二进制搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我问了一个有关减少未命中预测的问题.

I asked a question about reducing the miss prediction.

杰里·科芬(Jerry Coffin)给了我一个令人印象深刻的答案.

Jerry Coffin give me an impressive answer.

关于减少分支未命中的前提

二进制搜索是无分支的,但是当我在设置交集算法中使用它时,我发现它比原始二进制搜索要慢得多.是什么原因?

The binary search is branchless, but when I use it in my set intersection algorithm, I found it much slower than the original binary search. What's the reason?

更新:

我使用以下事件来测试i7处理器的分支未命中预测数:BR_MISS_PRED_RETIRED.我发现无分支版本的丢失率是原始分支的一半.

I use the following event to test number of branch miss prediction of i7 processor: BR_MISS_PRED_RETIRED. I found the branchless version is about half of the branch miss than the original one.

对于高速缓存未命中:我使用LLC_MISSES测试最后一级高速缓存未命中的数量,也是一半.

For cache miss: I use LLC_MISSES to test the number of last level cache misses, also half.

但是时间大约是原始时间的2.​​5倍.

But the time is about 2.5 times than the original one.

推荐答案

相对于分支错误预测,当数组较大且内存访问时间较大时,将发生条件移动(无分支)搜索问题.

The problem with the conditional move (branchless) search occurs when then arrays are large and the memory access time is large relative to a branch misprediction.

条件移动搜索类似于:

int needle; // value we are searching for
int *base = ...; // base pointer
int n; // number of elements in the current region
while (n > 1) {
  int middle = n/2;
  base += (needle < *base[middle]) ? 0 : middle;
  n -= middle;
}

请注意,我们有条件地更新了base而不使用分支(至少假设编译器没有决定将三进制运算符实现为分支).问题在于,每次迭代中base的值与上次迭代中的比较结果与数据有关,因此一次访问内存一次,并通过数据依赖性.

Note that we conditionally update base without using a branch (at least assuming the compiler doesn't decide to implement the ternary operator as a branch). The problem is that the value of base in each iteration is data-dependent on the result of the comparison in the previous iteration, and so accesses to memory occur one at a time, serialized via a the data dependency.

对于大型数组的搜索,这消除了内存级并行性的可能性,并且您的搜索需要使用log2(N) * average_access_time之类的东西.基于分支的搜索没有这种数据相关性:它仅在迭代之间具有推测的控制相关性:CPU会选择一个方向并顺其自然.如果猜测正确,则将同时加载当前迭代和下一个迭代的结果!事情还没有结束:猜测仍在继续,您可能一次要乘载十几架飞机.

For a search over large array, this removes the possibility of memory-level parallelism and your search takes something like log2(N) * average_access_time. A branch-based search has no such data dependency: it has only a speculated control dependency between iterations: the CPU picks a direction and goes with it. If it guesses right, you'll be loading the result from the current iteration and the next at the same time! It doesn't end there: the speculation continues and you might have have a dozen loads in flight at once.

当然,CPU并不总是会猜对!在最坏的情况下,如果分支完全不可预测(您的数据和指针值没有偏差),那么一半的时间将是错误的.尽管如此,这意味着平均而言,它将在飞行中维持0.5 + 0.25 + 0.125 + ... = ~1额外的访问权限,超过当前的访问权限.这不仅是理论上的:尝试对随机数据进行二进制搜索,由于并行度提高了一倍,您可能会看到无分支搜索的基于分支的速度提高了2倍.

Of course, the CPU doesn't always guess right! In the worst case, if the branches are totally unpredictable (your data and needle value don't have kind of bias), it will be wrong half the time. Still, that means that on average it will sustain 0.5 + 0.25 + 0.125 + ... = ~1 additional accesses in flight beyond the current one. This isn't just theoretical: try a binary search over random data and you'll probably see the 2x speedup for branch-based over the branchless search, due to double the parallelism.

对于许多数据集,分支方向并不是完全随机的,因此您可以看到2倍以上的加速比.

For many data sets the branch direction isn't entirely random, so you can see more than 2x speedup, as in your case.

对于适合缓存的小型阵列,情况则相反.无分支搜索仍然存在相同的串行依赖"问题,但是负载等待时间很小:很少的周期.另一方面,基于分支的搜索经常会出现错误的预测,其代价约为20个周期,因此在这种情况下,无分支的搜索通常会更快.

The situation is reversed for small arrays that fit in cache. The branchless search still has this same "serial dependency" problem, but the load latency is small: a handful of cycles. The branch-based search, on the other hand suffers constant mispredictions, which cost on the order of ~20 cycles, so branchless usually ends up faster in this case.

这篇关于关于无分支二进制搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆