如何解决指针数组中的数据依赖性? [英] How can I resolve data dependency in pointer arrays?

查看:104
本文介绍了如何解决指针数组中的数据依赖性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们有一个整数指针数组,它们全部指向同一个int,然后执行 ++ 运算遍历它,则它将比那些慢100%指向两个不同的整数的指针。这是一个具体示例

If we have an array of integer pointers which all pointing to the same int, and loop over it doing ++ operation, it'll be 100% slower than those pointers pointing to two different ints. Here is a concrete example

int* data[2];
int a, b;
a = b = 0;
for (auto i = 0ul; i < 2; ++i) {
    // Case 3: 2.5 sec
    data[i] = &a;

    // Case 2: 1.25 sec
    // if (i & 1)
    //     data[i] = &a;
    // else
    //     data[i] = &b;
}

for (auto i = 0ul; i < 1000000000; ++i) {
    // Case 1: 0.5sec
    // asm volatile("" : "+g"(i)); // deoptimize
    // ++*data[0];

    ++*data[i & 1];
}

总而言之,观察结果为:(描述了循环体)

In summary, the observations are: (described the loop body)

情况1(快速):++ *指针[0]

case 1 (fast): ++*pointer[0]

情况2(中):++ * pointer [i],一半指针指向一个int,另一半指针指向另一个int。

case 2 (medium): ++*pointer[i] with half pointer pointing to one int and other half pointing to another int.

情况3(慢):++ * pointer [i],所有指针都指向同一个int

case 3 (slow): ++*pointer[i] with all pointer pointing to the same int

这是我当前的想法。情况1很快,因为现代CPU知道我们在相同的内存位置进行读/写操作,从而缓冲了操作,而在情况2和情况3中,我们需要在每次迭代中写出结果。情况3比情况2慢的原因是,当我们通过指针a写入存储位置,然后尝试通过指针b读取它时,我们必须等待写入完成。这会停止超标量执行。

Here are my current thoughts. Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation, while in Case 2 and Case 3, we need to write the result out in each iteration. The reason that Case 3 is slower than Case 2 is because when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

我正确理解吗?有什么方法可以使Case 3更快而不更改指针数组? (也许添加一些CPU提示?)

Do I understand it correctly? Is there any way to make Case 3 faster without changing the pointer array? (perhaps adding some CPU hints?)

问题是从实际问题中提取的 https://github.com/ClickHouse/ClickHouse/pull/7550

The question is extracted from the real problem https://github.com/ClickHouse/ClickHouse/pull/7550

推荐答案

您已经发现了导致直方图瓶颈的一种影响。解决该问题的一种方法是保留多个计数器数组并循环通过它们,因此同一索引的重复运行将分布在内存中的2个或4个不同的计数器中。

You've discovered one of the effects that causes bottlenecks in histograms. A workaround for that problem is to keep multiple arrays of counters and rotate through them, so repeated runs of the same index are distributed over 2 or 4 different counters in memory.

(然后循环遍历一系列计数器,将它们汇总为一个最终的一组计数。这部分可以从SIMD中受益。)

(Then loop over the arrays of counters to sum them down into one final set of counts. This part can benefit from SIMD.)


案例1的速度很快,因为现代CPU知道我们在同一存储位置进行读/写操作,从而缓冲了操作

Case 1 is fast because modern CPU knows we are read/write the same memory location, thus buffering the operation

不,不是CPU,而是编译时优化。

No, it's not the CPU, it's a compile-time optimization.

++ * pointer [0] 之所以快,是因为编译器可以将存储/重装移出循环,而实际上只是增加一个寄存器。 (如果不使用结果,它甚至可能会对其进行优化。)

++*pointer[0] is fast because the compiler can hoist the store/reload out of the loop and actually just increment a register. (If you don't use the result, it might optimize away even that.)

假定没有数据争用UB,则编译器会认为没有其他修改 pointer [0] ,所以肯定是同一对象每次都递增。而且,按条件规则可以将 * pointer [0] 保留在寄存器中,而不是实际执行内存目标增量。

Assumption of no data-race UB lets the compiler assume that nothing else is modifying pointer[0] so it's definitely the same object being incremented every time. And the as-if rule lets it keep *pointer[0] in a register instead of actually doing a memory-destination increment.

这意味着该增量需要1个周期的延迟,并且它当然可以将多个增量合并为一个并执行 * pointer [0] + = n

So that means 1 cycle latency for the increment, and of course it can combine multiple increments into one and do *pointer[0] += n if it fully unrolls and optimizes away the loop.


当我们通过指针写入内存位置时,完全展开并优化了循环。 a,然后尝试通过指针b读取它,我们必须等待写入完成。这会停止超标量执行。

when we write to a memory location by pointer a, and then trying to read it by pointer b, we have to wait the write to finish. This stops superscalar execution.

是的,通过该内存位置的数据依赖性是问题所在。在编译时不知道所有指针都指向同一位置的情况下,编译器将使asm确实增加指向的内存位置。

Yes, the data dependency through that memory location is the problem. Without knowing at compile time that the pointers all point to the same place, the compiler will make asm that does actually increment the pointed-to memory location.

但是,写完并不是严格准确的。 CPU具有一个存储缓冲区,以使存储执行与高速缓存未命中脱钩,并从实际提交给L1d且对其他内核可见的存储中进行乱序的推测执行。重新加载最近存储的数据不必等待其提交进行缓存;从存储缓冲区到重新加载的存储转发是CPU检测到的东西。

"wait for the write to finish" isn't strictly accurate, though. The CPU has a store buffer to decouple store execution from cache misses, and out-of-order speculative exec from stores actually committing to L1d and being visible to other cores. A reload of recently-stored data doesn't have to wait for it to commit to cache; store forwarding from the store-buffer to a reload is a thing once the CPU detects it.

在现代Intel CPU上,存储转发延迟大约5个周期,因此内存目标添加具有6个周期的延迟。 (对于添加项,为1,对于关键路径,为5,用于存储/重新加载。)

On modern Intel CPUs, store-forwarding latency is about 5 cycles, so a memory-destination add has 6-cycle latency. (1 for the add, 5 for the store/reload if it's on the critical path.)

是的,无序执行使这6个中的两个周期延迟依赖关系链并行运行。而且,循环开销隐藏在该延迟下,再次由OoO执行人员进行。

And yes, out-of-order execution lets two of these 6-cycle-latency dependency chains run in parallel. And the loop overhead is hidden under that latency, again by OoO exec.

相关:

  • Store-to-Load Forwarding and Memory Disambiguation in x86 Processors on stuffedcow.net
  • Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?
  • How does store to load forwarding happens in case of unaligned memory access?
  • Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
  • Why is execution time of a process shorter when another process shares the same HT core (On Sandybridge-family, store-forwarding latency can be reduced if you don't try to reload right away.)

有什么方法可以使Case 3更快而不更改指针数组?

Is there any way to make Case 3 faster without changing the pointer array?

是的,如果预计会发生这种情况,也许在其上分支

Yes, if that case is expected, maybe branch on it:

    int *current_pointer = pointer[0];
    int repeats = 1;
    ...

    loop {
        if (pointer[i] == current_pointer) {
            repeats++;
        } else {
            *current_pointer += repeats;
            current_pointer = pointer[i];
            repeats = 1;
        }
    }

我们通过计算游程长度进行优化重复相同的指针

这完全被案例2击败,如果长时间运行,效果会很差

This is totally defeated by Case 2 and will perform poorly if long runs are not common.

乱序执行程序可以隐藏短运行;只有当dep链变得足够长以填满ROB(重新排序缓冲区)时,我们才真正停顿。

Short runs can be hidden by out-of-order exec; only when the dep chain becomes long enough to fill the ROB (reorder buffer) do we actually stall.

这篇关于如何解决指针数组中的数据依赖性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆