SIMD何时更快的一些经验法则是什么?(SSE2，AVX) [英] What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)

查看：87 发布时间：2021/4/12 20:55:27 simd avx sse2

本文介绍了SIMD何时更快的一些经验法则是什么?(SSE2，AVX)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些代码可以一次处理3个对称集合，每个对称集合包含3个非对称整数值.有大量的条件代码和许多常量.

这已成为性能瓶颈，我正在寻找一些经验法则，以了解64位Intel/AMD CPU上的SIMD何时能获得性能优势.代码很长，而且我以前从未使用过SSE2或AVX，所以在我花时间之前，先了解一下是否有可能赢得性能，这将是很高兴的.

如果您愿意列出经验法则或指向此方面的现有白皮书，我将不胜感激.

解决方案

对于SIMD而言，有很多条件会以不同的方式影响相邻元素.这需要无分支的实现，因此您必须完成每个分支双方的所有工作，并进行合并.除非您可以检查向量中的所有4个元素(或全部16个元素或其他元素)是否都以相同的方式

有些东西可以向量化，即使您可能没有想到，因为它们是通常的经验法则的例外.例如将IPv4点分四进制字符串转换为32位IPv4地址，或将十进制数字字符串转换为整数，即实现 atoi() .通过巧妙使用多种技巧来对这些向量进行矢量化处理，其中包括用于 PSHUFB的混洗掩码查找表，将向量比较位图作为LUT的索引.

因此，一旦您了解了一些技巧，便总是会根据一些经验法则迅速排除矢量化的实现.有时甚至可以解决串行依赖项，例如 SIMD前缀总和./p>

I have some code that operates on 3 symmetric sets of 3 asymmetric integer values at a time. There is a significant amount of conditional code and lots of constants.

This has become a perf bottleneck and I'm looking for some rules of thumb for when SIMD on 64-bit Intel/AMD CPUs would yield perf wins. The code is pretty long and I've never used SSE2 or AVX before, so it would be nice to have some idea of if perf wins are possible or likely before I invest the time.

If you're willing to list the rules of thumb or point to an existing whitepaper on this, I'd appreciate it.

解决方案

The sse tag wiki has a couple guides to vectorization, including these slides from a talk which are comprehensible on their own which have some great examples of transforming your data structures to enable vectorization (and classic pitfalls like putting [x,y,z,w] geometry vectors into single SIMD vectors).

The classic use-case for SIMD is when there are a lot of independent operations, i.e. no serial dependencies inside the loop, like z[i] = d * x[i] + y[i]. Or if there are, then only with associative operations that let you reorder. (e.g. summing an array or a similar reduction).

Also important that you can do it without a lot of shuffling; ideally all your data lines up "vertically" in vectors after loading from contiguous memory.

Having many conditions that go different ways for adjacent elements is usually bad for SIMD. That requires a branchless implementation, so you have to do all the work of both sides of every branch, plus merging. Unless you can check that all 4 (or all 16 or whatever) elements in your vector go the same way

There are some things that can be vectorized, even though you might not have expected it, because they're exceptions to the usual rules of thumb. e.g. converting an IPv4 dotted-quad string into a 32-bit IPv4 address, or converting a string of decimal digits to an integer, i.e. implementing atoi(). These are vectorized with clever use of multiple different tricks, including a lookup-table of shuffle-masks for PSHUFB with a vector-compare bitmap as an index for the LUT.

So once you know some tricks, you always rule out a vectorized implementation quickly just based on a few rules of thumb. Even serial dependencies can sometimes be worked around, like for SIMD prefix sums.

这篇关于SIMD何时更快的一些经验法则是什么?(SSE2，AVX)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

SIMD何时更快的一些经验法则是什么?(SSE2，AVX) [英] What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

SIMD何时更快的一些经验法则是什么?(SSE2，AVX) [英] What are some rules of thumb for when SIMD would be faster? (SSE2, AVX)

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭