AVX-512和分支 [英] AVX-512 and Branching

查看:268
本文介绍了AVX-512和分支的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于在分支机构中理论上的屏蔽能力,我感到困惑。假设我有一个Skylake-SP(哈,我希望......),并且我们忽略了编译器的功能,理论上可能是这样:

如果一个分支条件依赖于静态标志,并且所有分支都将数组设置为计算结果,假设编译器不会将其优化为两个单独的循环,它是否可以进行矢量化?如果(my_flag .eq。0)然后
a(i),则i = 1,nx


  = b(i)** 2 
else
a(i)= b(i)** 3
end if
end do



如果只是作为分支的子集来设置有问题的值,它可以进行矢量化吗?

  do i = 1,nx 
if(my_flag .eq。0)then
a(i)= b(i)** 2
end如果
end do

如果分支条件本身依赖于矢量数据,是否可以如果(c(i)> 0)则

  do i = 1,nx 
ba(i)= b(i)** 2
else
a(i)= b(i)** 3
end if
end do


解决方案

- 涉及矢量化时的访问问题,它主要应用于概念级别,用于将一系列标量访问转换为数组,并将其转换为矢量化访问假定有关底层数组的哪些部分被映射。在像Fortran这样的语言中,语言本身的语义本身可以保证数组是连续映射的,或者在进入循环之前进行边界检查可能足以避免下面提到的问题。



这个答案不应该被看作是一般的矢量化的一种很好的处理方式,当然不能在Fortran中专门做。对另一个答案的更全面的矢量化问题的处理,也是专门针对AVX-512。






一个经常被忽视的矢量化条件问题是编译器可以通过混合或其他基于元素的预测来矢量化你感兴趣的类型的条件循环只有在 时,他们才可以证明矢量化访问与标量逐元素实现中访问的相同元素。如果指令集没有提供按照元素的方式来进行与此条件相关的矢量加载,或者编译器无法使用它们,则可以有效地阻止矢量化。



<换句话说,如果通过循环体的所有路径访问相同的元素,编译器通常只能用简单的向量加载完全向量化。



其根本原因在于编译后的代码不能访问原始代码语义不能访问的元素,即使它们稍后被混合了,因为这样做可能会导致错误!如果指令集没有提供有条件地访问内存中的元素并抑制未选定元素的错误的指令,这是优化的一个重要障碍。在示例中(2)不能,因为(2)访问 a [i] ,这意味着(1)和(3)和 b [i] 仅在 if 主体中,但如果 if 不被执行。当然,真正的编译器只会在循环中提取一个简单的标志检查,并且根本不会在 myflag == false 中执行循环,所以它不是真的一个很好的例子。



我们来看几个包含所有示例的案例。首先,我们需要一个无法悬挂的标志 - 让我们使用一个 bool 值的数组。因此,一个有趣的输出数组 a ,两个输入数组 c 和一个标志数组 f 可能类似于:

 <$如果(f(i)> 0),则
a(i)= g(b(i),c(i));
else
a(i)= h(b(i),c(i));
end if
end do

根据标志 f(i)对应于每个元素,我们应用函数 g h 到输入元素 b(i) c(i)。根据我上面的条件,只有当 g h 实际访问相同时,我们才可以向量化元素 b c



让我们继续前进以上两个实际工作示例:

  void example1(bool * f,int * __restrict__ a,int * __restrict__ b, (f [i]){
a [i] = b [int] * b_f(int * __restrict__ c,size_t n){
for(size_t i = 0; i } else {
a [i] = c [i];
}
}
}

void example2(bool * f,int * __restrict__ a,int * __restrict__ b,int * __restrict__ c,size_t n){$如果(f [i]){
a [i] = b [i] + c [i],则b $ b表示(size_t i = 0; i
} else {
a [i] = b [i] - c [i] * 2 + 1;
}
}
}

两者具有相同的基本形式,但是难以矢量化?第一种是根据标志对 b [i] c [i] 进行简单的直接分配。第二个是 b [i] c [i] 这两个路径显着不同。



那么第二个更容易矢量化,因为它访问 b [i] c [i] 无条件。事实上,由于某些原因, gcc 不管理任何一个。 clang 仅将第二个向量化。有点令人惊讶的是 icc 管理向量化两者 - 因为它足够聪明可以使用 vpmaskmovd which是一个掩盖的负载,可以抑制卸载元素的错误。



您可以检查在godbolt上生成的程序集

我最初开始这个答案的想法是访问不同的数组元素目前是一个无法克服的矢量化障碍目前的编译器,但那是因为我通常不检查 icc 。这对我来说实际上是新闻, icc 以这种方式使用蒙版移动。所以这个障碍在那里,但至少有些编译器可能会对它产生错误。



作为开发者,你通常知道这两个数组都是完全可访问的,因此可以安全地访问范围内 b c 的所有元素> [0,n),将其传递给编译器将会很好。我尝试添加无条件的伪语句,如 b [i] = b [i]; c [i] = c [i]; ... + c [i] * 0 编译器在语义上查看所有元素都被访问。确实编译,但代码生成没有改进:不会发生额外的矢量化。可能在矢量化分析完成之前,它们在编译过程的早期就已经被清除了,这样信息就会丢失给矢量化器。除了蒙面移动指令之外,是不是自由的,并不完全一般,有没有其他方式可以改善这种情况?那么编译器可以利用其对平台内存保护模型的了解。例如,一旦访问了x86上的4K页面中的任何字节,就可以自由读取该页面上的所有其他字节。人们可以想象一个复杂的实现,它以安全的标量代码开始,但是一旦写入这两个数组被注意到切换到页面其余部分的矢量化循环中。



如果数组访问是对齐的,则可以使用类似的技巧:向量化的循环可以检查flag数组是否统一为0或一致1,如果不是,则使用直接的无条件非屏蔽读实现是安全的,回到更加谨慎的实施。如果口罩很少统一,或几乎总是统一的,那么这种转变显然只会是有利可图的,所以可能不太可能在实践中实施。






2 至少如果AVX可用: icc 仍然无法矢量化第一个例子,如果你将它限制在AVX之前的说明中,因为那是 vpmaskmovd / q vmaskmovps / pd

3 因为在这种情况下,如果你已经确定掩码是统一的,那么你可以无条件地实现这个操作,只要的选定的一边根据是否统一 - 0 或制服 - 1 没有任何掩蔽/混合。所以你最终得到了三个在内部实现的循环:全零标志情况,全一标志情况和混合标志情况,当下一个标志向量与当前循环不相同时,在它们之间跳转。


I'm confused as to what masking can do in theory in relation to branches. Let's say I have a Skylake-SP (ha, I wish..), and we're ignoring compiler capabilities, just what's possible in theory:

If a branch conditional is dependant on a static flag, and all branches set an array to a computational result, assuming the compiler does not optimize this to two separate loops anyways, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

If only as subset of the branches are setting the value in question, can it vectorize?

do i = 1, nx
  if (my_flag .eq. 0) then
    a(i) = b(i) ** 2
  end if
end do

If a branch conditional is in itself dependent on vector data, can it vectorize?

do i = 1, nx
  if (c(i) > 0) then
    a(i) = b(i) ** 2
  else
    a(i) = b(i) ** 3
  end if
end do

解决方案

Note: This answer mostly discusses a very specific memory-access issue when it comes to vectorization and it applies mostly at a conceptual level to transforming a series of scalar accesses to arrays into vectorized accesses without assuming anything about what portions of the underlying arrays are mapped. In languages like Fortran, the semantics of the language itself may guarantee that arrays are contiguously mapped, or bounds-checks before entering the loop might be enough to avoid the problem mentioned below.

This answer shouldn't be seen as a good treatment of vectorization in general and certainly not in Fortran specifically. A more comprehensive treatment of vectorization issues appears in another answer, which also specifically addresses AVX-512.


One often overlooked issue with vectorizing conditions is that compilers can vectorize conditional loops of the type you are interested in, via blending or other element-wise predication techniques, only if they can prove that the vectorization accesses the same elements as are accessed as in the scalar element-by-element implementation. If the instruction set doesn't offer an element-wise way to do vector loads respecting this condition, or if the compiler is unable to use them, this can effectively block vectorization.

Said another way, compilers can generally only fully vectorize with plain vector loads if all paths through the loop body access the same elements.

The underlying reason is that the compiled code must not access elements that aren't accessed by the semantics of the original code, even if they are later "blended away" since doing so might cause a fault! If the instruction set doesn't provide instructions to conditionally access elements in memory and suppress faults from not-selected elements, this is a significant barrier to optimization.

In the examples you gave this means that (1) and (3) could be vectorized "without hoisting the condition" while (2) could not, since (2) accesses a[i] and b[i] only in the if body, but not if the if isn't executed. Of course, a real compiler would just hoist a trivial flag check out of the loop and just not execute the loop at all in the myflag == false case, so it's not really a good example.

Let's just look at a couple of cases that subsumes all your examples. First, we need a flag that cannot be hoisted - let's just use an array of bool values. So an interesting somewhat general loop with an output array a, two input arrays b and c and a flag array f might look something like:

do i = 1, nx
  if (f(i) > 0) then
    a(i) = g(b(i), c(i));
  else
    a(i) = h(b(i), c(i));
  end if
end do

Depending on the flag f(i) corresponding to each element, we apply either the function g or h to the input elements b(i) and c(i). Per my condition above we can vectorize only if both g and h actually access the same elements of b and c.

Let's move on to two real work examples of the above:

void example1(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i];
        } else {
            a[i] = c[i];
        }
    }
}

void example2(bool* f, int* __restrict__ a, int* __restrict__ b, int* __restrict__ c, size_t n) {
    for (size_t i = 0; i < n; i++) {
        if (f[i]) {
            a[i] = b[i] + c[i] ;
        } else {
            a[i] = b[i] - c[i] * 2 + 1 ;
        }
    }
}

Both have the same basic form, but which is harder to vectorize? The first is a simple direct assignment of either b[i] or c[i] depending on the flag. The second is a more complex function of both b[i] and c[i] which are significantly different on both paths.

Well the second is much easier to vectorize since it accesses b[i] and c[i] unconditionally. In fact, gcc doesn't manage to vectorize either one for some reason. clang only vectorizes the second. Somewhat surprisingly icc manages to vectorize both - since it is smart enough to use vpmaskmovd which is a masked load that suppresses faults for unloaded elements.

You can examine the generated assembly on godbolt.

I had originally started this answer with the idea that accessing different array elements is currently an insurmountable barrier to vectorization for current compilers, but that's because I usually don't check icc. It's actually news to me that icc uses masked moves in this way. So the barrier is there, but at least some compilers can fault over it2.

As the developer, you usually know that both arrays are fully accessible, such that it is safe to access all elements of b and c in the range [0, n) and it would be nice to communicate that to the compiler. I've tried adding unconditional dummy statements like b[i] = b[i]; c[i] = c[i]; or ... + c[i] * 0 which should compile to nothing but at least allow the compiler to see that semantically all elements are accessed. The do indeed "compile away" but the code-generation is not improved: additional vectorization doesn't occur. Probably they are already eliminated early in the compilation process before the vectorizaton analysis is done, so that information is lost to the vectorizer.

Other than the masked-move instructions, which aren't free and are not fully general, are there any other ways this situation could be improved? Well a compiler could take advantage of its knowledge of the platform memory protection model. For example, once any byte in a 4K page on x86 has been accessed, it is free to read all other bytes on that page. One could imagine a complicated implementation that started out in safe scalar code but as soon as a write to both arrays was "noticed" switched over to a vectorized loop for the remainder of the page.

Similar tricks could be played if the array accesses were aligned: the vectorized loop could check that if flag array was uniformly 0 or uniformly 1, if not it is safe to use the straightforward unconditional unmasked read implementation, otherwise it would fall back to the more careful implementation. Such a transformation would evidently only be profitable if the masks were rarely uniform, or almost always uniform3, and so are probably unlikely to be implemented in practice.


2 At least if AVX is available: icc will still fail to vectorize the first example if you restrict it to pre-AVX instructions, since that's when vpmaskmovd/q and vmaskmovps/pd were introduced.

3 Since in that case if you've already determined the mask is uniform, you can implement the operation unconditionally by just doing the selected side of the if without any masking/blending based on whether it was uniform-0 or uniform-1. So you end up with three loops that internally implement: the all-zeros flag case, the all-ones flag case, and the mixed flag case, with jumps between them when the next vector of flags isn't the same as the current loop.

这篇关于AVX-512和分支的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆