使用SIMD指令时,为什么此简单的C ++ SIMD基准测试运行速度较慢? [英] Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?
问题描述
我正在考虑编写SIMD向量数学库,因此,作为一个快速的基准测试,我编写了一个程序,该程序进行1亿个(按4个浮点数)矢量逐元素乘法,并将它们相加在一起.对于我的经典非SIMD变体,我只是制作了一个具有4个浮点数的结构,并编写了自己的乘法函数"multiplyTwo",该函数将两个这样的结构元素明智地相乘,然后返回另一个结构.对于我的SIMD版本,我使用了"immintrin.h"以及__m128,_mm_set_ps和_mm_mul_ps.我正在i7-8565U处理器(威士忌湖)上运行,并使用以下命令进行编译:g++ main.cpp -mavx -o test.exe
以在GCC中启用AVX扩展指令.
I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe
to enable the AVX extension instructions in GCC.
奇怪的是,SIMD版本大约需要1.4秒,而非SIMD版本只需要1秒.我觉得好像做错了什么,因为我认为SIMD版本的运行速度应该快4倍.感谢您的帮助,代码如下.我已经在注释中放置了非SIMD代码,当前形式的代码是SIMD版本.
The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.
#include "immintrin.h" // for AVX
#include <iostream>
struct NonSIMDVec {
float x, y, z, w;
};
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);
int main() {
union { __m128 result; float res[4]; };
// union { NonSIMDVec result; float res[4]; };
float total = 0;
for(unsigned i = 0; i < 100000000; ++i) {
__m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
__m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
// NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i};
// NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};
result = _mm_mul_ps(a4, b4);
// result = multiplyTwo(a4, b4);
total += res[0];
total += res[1];
total += res[2];
total += res[3];
}
std::cout << total << '\n';
}
NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }
推荐答案
禁用优化功能(gcc默认值为-O0
),内部函数通常很糟糕. 反优化-O0
用于内在函数的code-gen 通常会遭受很多损失(甚至比标量更严重),并且某些类似于函数的内在函数会带来额外的存储/重载开销.加上-O0
的额外存储转发延迟往往会造成更大的损失,因为使用1个矢量而不是4个标量执行操作时,ILP会更少.
With optimization disabled (the gcc default is -O0
), intrinsics are often terrible. Anti-optimized -O0
code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0
tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.
使用gcc -march=native -O3
Use gcc -march=native -O3
但是,即使启用了优化,通过在循环中对每个向量进行水平相加,仍然会编写代码破坏SIMD的性能.请参见如何使用C中的SSE内在函数计算向量点积,以了解如何不执行此操作: _mm_add_ps
累积__m128 total
向量,并且仅在循环外将其水平求和.
But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps
to accumulate a __m128 total
vector, and only horizontal sum it outside the loop.
通过在循环内执行标量total +=
,使循环在FP添加延迟上成为瓶颈.该循环承载的依赖关系链意味着,在以addss
延迟为4个周期的Skylake派生的微体系结构上,循环的运行速度不能超过每4个周期1个float
. ( https://agner.org/optimize/)
You bottleneck your loop on FP-add latency by doing scalar total +=
inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float
per 4 cycles on your Skylake-derived microarchitecture where addss
latency is 4 cycles. (https://agner.org/optimize/)
比__m128 total
更好,使用4或8个向量隐藏FP添加延迟,因此您的SIMD循环可能成为mul/add(或FMA)吞吐量而不是延迟的瓶颈.
Even better than __m128 total
, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.
修复该问题后,正如@harold指出的那样,您在循环内使用_mm_set_ps
的方式将导致编译器生成非常糟糕的asm.当操作数不是常量,或者至少是循环不变的时,在循环内部不是一个好的选择.
Once you fix that, then as @harold points out the way you're using _mm_set_ps
inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.
您的示例显然是人为的;通常,您将从内存中加载SIMD向量.但是,如果确实需要更新__m128
向量中的循环计数器,则可以使用tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0))
.或者通过添加1.0、2.0、3.0和4.0来展开,这样循环传递的依赖关系在一个元素中仅为+ = 4.0.
Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128
vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0))
. Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.
x + 0.0
甚至对于FP也是标识操作(可能带符号的零除外),因此您可以对其他元素执行此操作而无需更改它们.
x + 0.0
is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.
或者对于向量的下限元素,可以使用_mm_add_ss
(标量)仅对其进行修改.
Or for the low element of a vector, you can use _mm_add_ss
(scalar) to only modify it.
这篇关于使用SIMD指令时,为什么此简单的C ++ SIMD基准测试运行速度较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!