SIMD何时更快的一些经验法则是什么?(SSE2,AVX) [英] What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)
问题描述
我有一些代码可以一次处理3个对称集合,每个对称集合包含3个非对称整数值.有大量的条件代码和许多常量.
这已成为性能瓶颈,我正在寻找一些经验法则,以了解64位Intel/AMD CPU上的SIMD何时能获得性能优势.代码很长,而且我以前从未使用过SSE2或AVX,所以在我花时间之前,先了解一下是否有可能赢得性能,这将是很高兴的.
如果您愿意列出经验法则或指向此方面的现有白皮书,我将不胜感激.
演讲中的这些幻灯片可以自行理解,其中有一些很好的例子,说明了如何转换数据结构以实现矢量化(以及经典的陷阱,例如放置 [x,y,z,w]
几何向量转换为单个SIMD向量).
SIMD的经典用例是当有很多独立的操作时,即循环内没有串行依赖性,例如 z [i] = d * x [i] + y [i] 代码>.或者,如果存在,则仅使用允许您重新排序的关联操作.(例如,对数组求和或类似的归约求和).
同样重要的是,您可以在不进行大量改组的情况下做到这一点;理想情况下,从连续内存加载后,所有数据都将在向量中垂直"排列.
对于SIMD而言,有很多条件会以不同的方式影响相邻元素.这需要无分支的实现,因此您必须完成每个分支双方的所有工作,并进行合并.除非您可以检查向量中的所有4个元素(或全部16个元素或其他元素)是否都以相同的方式
有些东西可以向量化,即使您可能没有想到,因为它们是通常的经验法则的例外.例如将IPv4点分四进制字符串转换为32位IPv4地址,或将十进制数字字符串转换为整数,即实现 atoi()
.通过巧妙使用多种技巧来对这些向量进行矢量化处理,其中包括用于 PSHUFB的混洗掩码查找表,将向量比较位图作为LUT的索引.
因此,一旦您了解了一些技巧,便总是会根据一些经验法则迅速排除矢量化的实现.有时甚至可以解决串行依赖项,例如 SIMD前缀总和./p>
I have some code that operates on 3 symmetric sets of 3 asymmetric integer values at a time. There is a significant amount of conditional code and lots of constants.
This has become a perf bottleneck and I'm looking for some rules of thumb for when SIMD on 64-bit Intel/AMD CPUs would yield perf wins. The code is pretty long and I've never used SSE2 or AVX before, so it would be nice to have some idea of if perf wins are possible or likely before I invest the time.
If you're willing to list the rules of thumb or point to an existing whitepaper on this, I'd appreciate it.
The sse tag wiki has a couple guides to vectorization, including these slides from a talk which are comprehensible on their own which have some great examples of transforming your data structures to enable vectorization (and classic pitfalls like putting [x,y,z,w]
geometry vectors into single SIMD vectors).
The classic use-case for SIMD is when there are a lot of independent operations, i.e. no serial dependencies inside the loop, like z[i] = d * x[i] + y[i]
. Or if there are, then only with associative operations that let you reorder. (e.g. summing an array or a similar reduction).
Also important that you can do it without a lot of shuffling; ideally all your data lines up "vertically" in vectors after loading from contiguous memory.
Having many conditions that go different ways for adjacent elements is usually bad for SIMD. That requires a branchless implementation, so you have to do all the work of both sides of every branch, plus merging. Unless you can check that all 4 (or all 16 or whatever) elements in your vector go the same way
There are some things that can be vectorized, even though you might not have expected it, because they're exceptions to the usual rules of thumb. e.g. converting an IPv4 dotted-quad string into a 32-bit IPv4 address, or converting a string of decimal digits to an integer, i.e. implementing atoi()
. These are vectorized with clever use of multiple different tricks, including a lookup-table of shuffle-masks for PSHUFB with a vector-compare bitmap as an index for the LUT.
So once you know some tricks, you always rule out a vectorized implementation quickly just based on a few rules of thumb. Even serial dependencies can sometimes be worked around, like for SIMD prefix sums.
这篇关于SIMD何时更快的一些经验法则是什么?(SSE2,AVX)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!